## 1.1. Get the list of places
We start with the list of places to include in your corpus of documents. In particular, we focus on the Most popular places. Next, we want you to collect the URL associated with each site in the list from this list. The list is long and split into many pages. Therefore, we ask you to retrieve only the URLs of the places listed in the first 400 pages (each page has 18 places, so that you will end up with 7200 unique place URLs).

The output of this step is a .txt file whose single line corresponds to the place's URL.

In [2]:
import requests
import bs4
import os
from datetime import datetime
import pandas as pd
import numpy as np

Since the first page have a different url we have saved the url in the variable urlP.\
Then the other 399 have a common url, the only different is the number of the page, so we have splitted the url in two parts,\
so every time we want to access a different url we can simply concatenate these two with the number of the page in the middle.

In [None]:
urlP = "https://www.atlasobscura.com/places?sort=likes_count"
url1="https://www.atlasobscura.com/places?page="
url2="&sort=likes_count"

In [None]:
def extract_single_link(url1,url2):
     with open("Places/Address.txt","w") as f:
        for i in range(1,401):
            if i==1:
                result = requests.get(urlP)
                soup = bs4.BeautifulSoup(result.text,"lxml")
                for item in soup.find_all('a',{'class':'content-card content-card-place'}):
                    f.write("https://www.atlasobscura.com"+item["href"]+"\n")
            else:
                i=str(i)
                url = url1+i+url2
                result = requests.get(url)
                soup = bs4.BeautifulSoup(result.text,"lxml")
                for item in soup.find_all('a',{'class':'content-card content-card-place'}):
                    f.write("https://www.atlasobscura.com"+item["href"]+"\n")



In [None]:
extract_single_link(url1,url2)

## 1.2. Crawl places
Once you get all the URLs in the first 400 pages of the list, you:

Download the HTML corresponding to each of the collected URLs.
After you collect a single page, immediately save its HTML in a file. In this way, if your program stops for any reason, you will not lose the data collected up to the stopping point.
Organize the entire set of downloaded HTML pages into folders. Each folder will contain the HTML of the places on page 1, page 2, ... of the list of locations.
Tip: Due to a large number of pages you should download, you can use some methods that can help you shorten the time it takes. If you employed a particular process or approach, kindly describe it.

In [None]:
with open('Places/Address.txt', 'r') as f:
    urls_places = f.readlines()

for index, url in enumerate(urls_places):                                                  #variable index is useful to take in account the ranking of the place
    page_number = index//18                                                                 #with this variable we are able to allocate the html file in the write folder
    if index>=0 and index<=8:
        filename=f'Page_{page_number+1}/0{index+1}_{url[36:-1].replace("-"," ")}.html'      #added this part of the code to have all the files ordered
    else:
        filename=f'Page_{page_number+1}/{index+1}_{url[36:-1].replace("-"," ")}.html'       #url[36:1] since from position 36 starts the name of the place for every url
    os.makedirs(os.path.dirname(filename), exist_ok=True)                                   #we have to say that this direction exists otherwise we will be have an error
    try:
        if (os.path.getsize(filename) < 2000):
            with open(filename, 'w',encoding="utf-8") as file:
                r = requests.get(url[:-1])
                if r.status_code==200:
                    file.write(r.text)
                    print("Downloaded place "+str(index+1)+", Page "+str(page_number+1))
        else:
            print("done")
    except Exception as e:
        print("Error occured! "+ str(e))

## 1.3 Parse downloaded pages
At this point, you should have all the HTML documents about the places of interest, and you can start to extract the places' information. The list of the information we desire for each place and their format is as follows:

__Place Name__ (to save as __placeName__): String.\
__Place Tags__ (to save as __placeTags__): List of Strings.\
\# of people who have been there (to save as __numPeopleVisited__): Integer.\
\# of people who want to visit the place(to save as __numPeopleWant__): Integer.\
Description (to save as __placeDesc__): String. Everything from under the first image up to "know before you go" (orange frame on the example image).\
Short Description (to save as __placeShortDesc____): String. Everything from the title and location up to the image (blue frame on the example image).\
Nearby Places (to save as __placeNearby__): Extract the names of all nearby places, but only keep unique values: List of Strings.\
Address of the place(to save as __placeAddress__): String.\
Altitude and Longitude of the place's location(to save as __placeAlt__ and __placeLong__): Integers\
The username of the post editors (to save as __placeEditors__): List of Strings.\
Post publishing date (to save as __placePubDate__): datetime.\
The names of the lists that the place was included in (to save as __placeRelatedLists__): List of Strings.\
The names of the related places (to save as __placeRelatedPlaces__): List of Strings.\
The URL of the page of the place (to save as __placeURL__):String

In [12]:
#le prime 7 righe servono per iterare su tutti i folder, che sono 400, quindi cambiare alla fine in range(1,401)
with open('Places/Address.txt', 'r') as f:
    urls_places = f.readlines()

for i in range(1,401):
    page = str(i)
    directory = f'Page_{page}'
# iterate over files in
# that directory
    for index,filename in enumerate(os.listdir(directory)):
        with open(f"{directory}/{filename}", 'r',encoding="utf-8") as f:
            soup = bs4.BeautifulSoup(f.read(), 'lxml')
            urlPlace = urls_places[((i-1)*18)+(index)]
            indice=str(((i-1)*18)+(index)+1)
            data = extract_single_place(soup)
            data['urlPlace']=urlPlace
            file_name = f"place_{indice}.tsv"
            create_csv(data,file_name)
            


In [2]:
def extract_single_place(soup):
    placeName = str(soup.find_all('h1',{'class': 'DDPage__header-title'})[0].text)
    placeTags = [x.text.replace("\n","") for x in soup.find_all('a',{'class': 'itemTags__link js-item-tags-link'})]
    numPeopleVisited = int(soup.find_all('div',{'class': 'title-md item-action-count'})[0].text)
    numPeopleWant = int(soup.find_all('div',{'class': 'title-md item-action-count'})[1].text)
    placeDesc = "".join([x.text.replace("\xa0","") for x in soup.find_all('div',{'class':'DDP__body-copy'})])
    placeShortDesc = str(soup.find_all('h3',{'class': 'DDPage__header-dek'})[0].contents[0])
    placeNearby = [x.text for x in soup.find_all('div',{'class':'DDPageSiderailRecirc__item-title'})]
    placeAddress = find_address(soup)
    placeAlt,placeLong = AltLong(soup)
    placeEditors = [x.text.replace("\n","") for x in soup.find_all("a",{'class':'DDPContributorsList__contributor'})]
    placePubDate = tempo(soup)
    placeRelatedLists,placeRelatedPlaces = RelatedPlace(soup)
    
    return {'placeName':placeName,
            'placeTags':placeTags,
            'numPeopleVisited':numPeopleVisited,
            'numPeopleWant':numPeopleWant,
            'placeDesc':placeDesc,
            'placeShortDesc':placeShortDesc,
            'placeNearby':placeNearby,
            'placeAddress':placeAddress,
            'placeAlt':placeAlt,
            'placeLong':placeLong,
            'placeEditors':placeEditors,
            'placePubDate':placePubDate,
            'placeRelatedLists':placeRelatedLists,
            'placeRelatedPlaces':placeRelatedPlaces}


In [3]:
def tempo(soup):
    if len(soup.find_all('div',{'class':'DDPContributor__name'}))>0:
        placePubDate = datetime.strptime(soup.find_all('div',{'class':'DDPContributor__name'})[0].text.replace(",",""),'%B %d %Y')
    else:
        placePubDate = np.datetime64("NaT")

In [4]:
def create_csv(data,file_name):
    df = pd.DataFrame.from_dict(data, orient='index')
    df = df.transpose()
    df.to_csv(f'{file_name}',index=False,sep="\t")

In [5]:
def RelatedPlace(soup):
    RelaPlace = soup.find_all('h3',{'class':'Card__heading --content-card-v2-title js-title-content'})
    placeRelatedLists = []
    placeRelatedPlaces =[]
    app=0
    for a in soup.find_all('div',{'class':'CardRecircSection__title'}):
        if 'Appears in' in a.text:
            app = int(a.text[11])
    for place in RelaPlace[-app:]:
        placeRelatedLists.append(place.span.text)
    for place in RelaPlace[-(app+4):-app]:
        placeRelatedPlaces.append(place.span.text)
    return placeRelatedLists,placeRelatedPlaces

In [11]:
def AltLong(soup):
    AltLong = soup.find_all('div',{'class' : 'DDPageSiderail__coordinates js-copy-coordinates'})[0].contents[2].replace("\n","").replace(" ","").split(",")
    placeAlt = float(AltLong[0])
    placeLong = float(AltLong[1])
    return placeAlt,placeLong

In [10]:
def find_address(soup):
    address = soup.find_all('address',{'class':'DDPageSiderail__address'})
    placeAddress =""
    for add in address[0].div.contents:
        if type(add)==bs4.element.NavigableString:
            placeAddress = placeAddress+" "+add
    return placeAddress
        