# Capstone Project:
# Data Download and Data Cleaning

## Chosen Source of Data

- Train the caption generator on the captions and images of Atlas Obscura.<br>
- Atlas Obscura is an online magazine and travel company. It was founded in 2009 by author Joshua Foer and documentary filmmaker/author Dylan Thuras. It catalogs unusual and obscure travel destinations via user-generated content.
- https://www.atlasobscura.com/places

## Data Download

**Method references:**<br>
- https://www.analyticsvidhya.com/blog/2017/07/web-scraping-in-python-using-scrapy/<br>
- https://www.datacamp.com/community/tutorials/making-web-crawlers-scrapy-python<br>
- https://letslearnabout.net/tutorial/scrapy-tutorial/python-scrapy-tutorial-for-beginners-01-creating-your-first-spider/<br>
- https://youtu.be/quMUjys9BcU<br>

- https://towardsdatascience.com/scrape-multiple-pages-with-scrapy-ea8edfa4318<br>

**Captions and photograph URLs were downloaded using Scrapy into a CSV file on my computer.**

In [51]:
import numpy as np
import pandas as pd
import os
import requests
import time

In [52]:
df = pd.read_csv("../Data/atlas_edits.csv")

In [53]:
df

Unnamed: 0,url,description,response
0,https://assets.atlasobscura.com/media/W1siZiIs...,A tangible and sobering reminder of an atrocit...,200
1,https://assets.atlasobscura.com/media/W1siZiIs...,"Hidden in the forest are the crumbling, graffi...",200
2,https://assets.atlasobscura.com/media/W1siZiIs...,The Great Fire of London and the regulation of...,200
3,https://assets.atlasobscura.com/media/W1siZiIs...,This beautifully restored fort kept the city s...,200
4,https://assets.atlasobscura.com/media/W1siZiIs...,A former Roman town that was once the home of ...,200
...,...,...,...
19470,https://assets.atlasobscura.com/media/W1siZiIs...,A varied collection of brains in alcohol.,200
19471,https://assets.atlasobscura.com/media/W1siZiIs...,Dental instruments from the illustrious histor...,200
19472,https://assets.atlasobscura.com/media/W1siZiIs...,An ancient rock engraving shows Aboriginals us...,200
19473,https://assets.atlasobscura.com/media/W1siZiIs...,A vision of beautiful Hindu architecture in th...,200


In [54]:
df.isnull().sum()

url            0
description    0
response       0
dtype: int64

In [55]:
df["response"].unique()

array([200])

In [56]:
# for first run

# print('Beginning file download.')

# result_df = pd.DataFrame(index = df.index, columns = ["Status", "Content-Type"])
# for i in range(0,len(df)):
#     url = df.loc[i,"url"]
#     r = requests.get(url)
#     with open("../Data/Atlas_Images/" + str(df.index[i]) + ".jpg", 'wb') as f:
#         f.write(r.content)
#     print("Download of file number",f"str(df.index[{i}])","complete.")
#     print(r.status_code)
#     print(r.headers['content-type'])
#     result_df.loc[i,"Status"] = r.status_code
#     result_df.loc[i,"Content-Type"] = r.headers['content-type']
#     time.sleep(5)

# result_df

## Data Exploration

### Remove Duplicate Captions and Images

In [57]:
print(len(df))
print(len(df["url"].unique()))

19475
19460


In [58]:
duplicates = df[df.duplicated(subset=['url'])]

In [59]:
deletion_list = [str(i) + ".jpg" for i in list(duplicates.index)]
deletion_list

['7758.jpg',
 '9270.jpg',
 '9971.jpg',
 '12527.jpg',
 '12617.jpg',
 '12618.jpg',
 '13373.jpg',
 '13409.jpg',
 '13410.jpg',
 '14165.jpg',
 '19458.jpg',
 '19462.jpg',
 '19465.jpg',
 '19468.jpg',
 '19469.jpg']

In [60]:
df.drop(index = df[df.duplicated(subset=['url'])].index, inplace = True)

In [61]:
directory = "/Users/yannusinovich/Documents/GA-DSI-Tor/DSI-7-lessons-local/Capstone/Data/Atlas_Images/"
for filename in os.listdir(directory):
    if filename in deletion_list:
        os.remove(directory + filename)

### Explore Caption Lengths

In [62]:
df['total_words'] = df['description'].str.count(' ') + 1

In [63]:
max(df["total_words"])

36

In [64]:
min(df["total_words"])

1

In [65]:
np.mean(df["total_words"])

14.085457348406988

In [66]:
df["total_words"].value_counts(normalize = True)

14    0.089209
13    0.088592
15    0.087770
12    0.084635
16    0.081706
17    0.074666
11    0.070812
18    0.061048
10    0.060123
9     0.049075
19    0.047379
8     0.039054
20    0.034943
7     0.028520
21    0.025077
22    0.017780
6     0.016804
5     0.009764
23    0.009712
24    0.006321
25    0.004728
4     0.004111
26    0.002364
3     0.001696
27    0.001644
28    0.000617
2     0.000462
29    0.000411
30    0.000360
31    0.000206
1     0.000154
32    0.000103
36    0.000051
35    0.000051
34    0.000051
Name: total_words, dtype: float64

In [67]:
df[df["total_words"] <= 1]

Unnamed: 0,url,description,response,total_words
10314,https://assets.atlasobscura.com/media/W1siZiIs...,Taumatawhakatangihangakoauauo\ntamateaturipuka...,200,1
10359,https://assets.atlasobscura.com/media/W1siZiIs...,.,200,1
15396,https://assets.atlasobscura.com/media/W1siZiIs...,.,200,1


### Delete Short Captions That Won't Train the Model, and Their Corresponding Images

In [68]:
shorts = df[df["total_words"] <= 1]

In [69]:
deletion_list_2 = [str(i) + ".jpg" for i in list(shorts.index)]
deletion_list_2

['10314.jpg', '10359.jpg', '15396.jpg']

In [70]:
df.drop(index = df[df["total_words"] <= 1].index, inplace = True)

In [71]:
directory = "/Users/yannusinovich/Documents/GA-DSI-Tor/DSI-7-lessons-local/Capstone/Data/Atlas_Images/"
for filename in os.listdir(directory):
    if filename in deletion_list_2:
        os.remove(directory + filename)

In [72]:
df

Unnamed: 0,url,description,response,total_words
0,https://assets.atlasobscura.com/media/W1siZiIs...,A tangible and sobering reminder of an atrocit...,200,16
1,https://assets.atlasobscura.com/media/W1siZiIs...,"Hidden in the forest are the crumbling, graffi...",200,23
2,https://assets.atlasobscura.com/media/W1siZiIs...,The Great Fire of London and the regulation of...,200,16
3,https://assets.atlasobscura.com/media/W1siZiIs...,This beautifully restored fort kept the city s...,200,21
4,https://assets.atlasobscura.com/media/W1siZiIs...,A former Roman town that was once the home of ...,200,12
...,...,...,...,...
19470,https://assets.atlasobscura.com/media/W1siZiIs...,A varied collection of brains in alcohol.,200,7
19471,https://assets.atlasobscura.com/media/W1siZiIs...,Dental instruments from the illustrious histor...,200,9
19472,https://assets.atlasobscura.com/media/W1siZiIs...,An ancient rock engraving shows Aboriginals us...,200,17
19473,https://assets.atlasobscura.com/media/W1siZiIs...,A vision of beautiful Hindu architecture in th...,200,13


In [73]:
# these image files were not found
df.drop(index = [168, 2904, 14242], inplace = True)

In [74]:
df.to_csv("../Data/atlas_edits_clean.csv", index = True)