# Data Collection

> In this notebook we will be uploading and checking data from the [WikiArt | All Images]('https://www.kaggle.com/datasets/antoinegruson/-wikiart-all-images-120k-link') from Kaggle. This dataset will be utilized throughout these notebooks as we clean, perform EDA, and model the data derived from this dataset.

---

## Imports

In [1]:
import pandas as pd

## Reading in Data

> We found this dataset, `wikiart_scraped.csv` on Kaggle, it was used in a competition to predict if two paintings were by the same artist. This is very different from our goal of predicting painting by image so we felt that it was permitted for use. It contains `['Style', 'Artwork', 'Artist', 'Date', 'Link']` which will be useful in EDA and some features can be engineered off of.

In [2]:
df = pd.read_csv('../data/wikiart_scraped.csv')
df

Unnamed: 0,Style,Artwork,Artist,Date,Link
0,Early-Dynastic,Narmer Palette,Ancient Egypt,3050 BC,https://uploads3.wikiart.org/00265/images/anci...
1,Early-Dynastic,Box Inlay with a Geometric Pattern,Ancient Egypt,3100-2900 BC,https://uploads2.wikiart.org/00244/images/anci...
2,Old-Kingdom,Khafre Enthroned,Ancient Egypt,2570 BC,https://uploads2.wikiart.org/00305/images/anci...
3,Middle-Kingdom,Stele of the Serpent King (Stela of Djet),Ancient Egypt,3000 BC,https://uploads7.wikiart.org/00305/images/anci...
4,Middle-Kingdom,"Laden Donkeys and Ploughing, Tomb of Djar",Ancient Egypt,2060-2010 BC,https://uploads8.wikiart.org/00244/images/anci...
...,...,...,...,...,...
124165,Street-Photography,Portrait of the corn stalk,Alfred Freddy Krupa,2019,https://uploads5.wikiart.org/00241/images/alfr...
124166,Street-Photography,The other side of life,Alfred Freddy Krupa,2019,https://uploads7.wikiart.org/00241/images/alfr...
124167,Street-Photography,The bonfire during construction,Alfred Freddy Krupa,2019,https://uploads7.wikiart.org/00242/images/alfr...
124168,Street-Photography,Limpidity,Alfred Freddy Krupa,2019,https://uploads7.wikiart.org/00248/images/alfr...


## Dropping Bad URLS

> Through our data cleaning and image scraping, some images have placeholder links or dead links. To ensure these links don't accidentally reach the image scraper we drop them here and save the data as `raw_data.csv` to ensure no dead links are in the dataset.

In [18]:
bad_URLs = [
'https://uploads2.wikiart.org/images/henri-rousseau/view-of-the-bridge-at-sevres-and-the-hills-at-clamart-st-cloud-and-bellevue-1908.jpg',
'https://uploads8.wikiart.org/images/jean-arp/abstract-composition.jpg',
'https://uploads2.wikiart.org/images/franz-marc/sleeping-animals-1913.jpg',
'https://uploads5.wikiart.org/images/el-lissitzky/central-park-of-culture-and-leisure-sparrow-hills.jpg',
'https://uploads1.wikiart.org/images/juan-gris/glass-and-carafe-1917.jpg',
'https://uploads6.wikiart.org/images/juan-gris/landscape-at-beaulieu-1918.jpg',
'https://uploads8.wikiart.org/images/pablo-picasso/untitled-1920-2.jpg',
'https://uploads0.wikiart.org/images/juan-gris/the-open-window-1921.jpg',
'https://uploads0.wikiart.org/images/georgia-o-keeffe/special-no-32.jpg',
'https://uploads.wikiart.org/Content/wiki/img/lazy-load-placeholder.png'
]

def drop_dead_rows(dataframe, bad_URLs):
    dataframe = dataframe.drop(dataframe[dataframe['Link'].isin(bad_URLs)].index)
    return dataframe.reset_index(drop=True)

df = drop_dead_rows(df, bad_URLs)

## Saving Collected Data

> This data is being saved to `raw_data.csv` to be later cleaned in `02_Data_Cleaning`.  

In [17]:
df.to_csv('../data/raw_data.csv', index = False)