# Dive into data: Finding and Preparing Data

##### The first practical part is about data.

Computer Vision Winterschool 2023 Example:
Image classifier that assigns images to one of two categories based on labeled training data - __estimating the date of origin of the photos (19th or 20th century)__

To train such a model we need __photos with corresponding labels__, meaning with the known date of origin, in our case 19th or 20th century.

### Goal of this Session:
- get the images
- make a custom Pytorch dataset

# 1. download .csv


# 2. get comfortable with it 


# 3. what do we need?


# 4. how can we get the photos (psudocode)
CheatSheat


# Example:

We now know that we need to filter the entries by type and date to get the photos that are relevant to us.

There are many ways to do this. A very popular library that can help us do this is Pandas, a library for Python to process, analyze and display data.

In [None]:
import pandas as pd
import requests
import time
import os
import numpy as np

In [None]:
print(os.system("pwd"))

In [None]:
#read csv to dataframe
df = pd.read_csv('../vif.csv', encoding='utf8')
#print(df)

In [None]:
#strip all string values of the dataframe to remove unwanted whitespaces and newlines

df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())

In [None]:
#Replacing blank values (white space) with NaN in pandas

df = df.replace(r'^\s*$', np.nan, regex=True)
#print(df)

In [None]:
#get all different values of the column 'Type'
print(df['Type'].unique())

In [None]:
#count type
print(len(df[df['Type'] == 'Photograph']))

In [None]:
#copy the dataframe to a new dataframe where the column 'Type' is equal to 'Photograph'

df2 = df[df['Type'] == 'Photograph']

In [None]:
#delete entries where the column 'Date (From)' is empty

df2 = df2.dropna(subset=['Date (From)'])

In [None]:
#df2['Date (From)'] = df2['Date (From)'].astype('Int32')
#df2['Date (To)'] = df2['Date (To)'].astype('Int32')

df2['Date (From)'] = df2['Date (From)'].astype('float').astype('Int32')
df2['Date (To)'] = df2['Date (To)'].astype('float').astype('Int32')

#TypeError: object cannot be converted to an IntegerDtype
#https://stackoverflow.com/questions/60024262/error-converting-object-string-to-int32-typeerror-object-cannot-be-converted

In [None]:
#delete entries where the column 'Date (From)' is less then 1900 and column 'Date (To)' is greater then 1900

df2 = df2.drop(df2[(df2['Date (From)'] < 1900) & (df2['Date (To)'] >= 1900)].index)

#TypeError: '<' not supported between instances of 'str' and 'int'

In [None]:
#delete entries where the column 'Date (From)' is less then 1900 and column 'Date (Appendix)' is 'after'

df2 = df2.drop(df2[(df2['Date (From)'] < 1900) & (df2['Date (Appendix)'] == 'after')].index)


In [None]:
#delete entries where the column 'Date (From)' is greater then 1900 and column 'Date (Appendix)' is 'before'

df2 = df2.drop(df2[(df2['Date (From)'] > 1900) & (df2['Date (Appendix)'] == 'before')].index)


In [None]:
#delete entries where the column 'Date (From)' is greater or equal 2000

df2 = df2.drop(df2[(df2['Date (From)'] >= 2000)].index)


It turns out that we only need IMAGE 1. 
Now that we have filtered the entries, let's think about the best way to organize the photos locally for our PyTorch dataset.

Basically, we have many options. Two common approaches are: 

- we have a directory per GT (label)
- we have a directory and a .csv file for it (with names of the photos and associated GT)

In [None]:
#add a new empty column 'GT' to the dataframe

df2['GT'] = ''

In [None]:
#add 19 to the column 'GT' if the column 'Date (To)' is less or equal 1900 or column 'Date (From)' is less or equal 1900 and column 'Date (Appendix)' is 'before' or column 'Date (From)' is less than 1900 and column 'Date (Appendix)' and column 'Date (To)' are empty

df2.loc[(df2['Date (To)'] <= 1900) | ((df2['Date (From)'] <= 1900) & (df2['Date (Appendix)'] == 'before')) | ((df2['Date (From)'] < 1900) & (df2['Date (Appendix)'].isnull()) & (df2['Date (To)'].isnull())), 'GT'] = '19'

In [None]:
#add 20 to the column 'GT' if the column 'Date (From)' is greater or equal 1900

df2.loc[(df2['Date (From)'] >= 1900), 'GT'] = '20'

In [None]:
#count entries
print(len(df2))

In [None]:
#exprot to csv

df2.to_csv('../photographs.csv', index=False, encoding='utf8')

In [None]:
#download the images from the dataframe column 'Image 1' and save them with the name of the column 'Permalink' withouth the 'https://gams.uni-graz.at/archive/objects/o:' and with the file extension '.jpg', in the folder 'images/19/' if the column 'GT' is equal to '19' and in the folder 'images/20/' if the column 'GT' is equal to '20'


#create directories if they don't exist
os.makedirs('../data/century_ds/19')
os.makedirs('../data/century_ds/20')

for index, row in df2.head(10).iterrows(): #remove .head(10) if you want to download all the pictures; in this example you will get only first 9 of it (all in the directory 19)
    img_name = row['Permalink'].split(':')[2]

    r = requests.get(row['Image 1'])
    r.raise_for_status()
    if row['GT'] == '19':
        with open('../data/century_ds/19/' + img_name + '.jpg', 'wb') as f:
            f.write(r.content)
            #add a delay of 0.2 second to avoid a 429 error
            time.sleep(0.2)
    else:
        with open('../data/century_ds/20/' + img_name + '.jpg', 'wb') as f:
            f.write(r.content)
            #add a delay of 0.2 second to avoid a 429 error
            time.sleep(0.2)