# South America Sites


Project Plan:

1. Data Aquisition
2. Data Cleaning

# Data Aquisition

In [149]:
import pandas as pd # library for data analysis
import requests # library to handle requests
from bs4 import BeautifulSoup # library to parse HTML documents

In [150]:
wikiurl="https://en.wikipedia.org/wiki/List_of_World_Heritage_Sites_in_South_America"


def get_data(wikiurl):

    # get the response in the form of html
    table_class="wikitable sortable jquery-tablesorter"
    response=requests.get(wikiurl)
    print(response.status_code)
    
    # parse data from the html into a beautifulsoup object
    soup = BeautifulSoup(response.text, 'html.parser')
    indiatable=soup.find('table',{'class':"wikitable"})

    df=pd.read_html(str(indiatable))
    # convert list to dataframe
    df=pd.DataFrame(df[0])
    return df

df = get_data(wikiurl)

200


In [151]:
df.head()

Unnamed: 0,Site,Image,Location,Criteria,Areaha (acre),Year,Description,Refs
0,Atlantic Forest South-East Reserves,,"Paraná, São Paulo and Rio de Janeiro states, B...","Natural:(vii), (ix), (x)","468,193 (1,156,930); buffer zone 1,223,557 (3,...",1999,The site comprises some of the last remaining ...,[3]
1,Brasília,,"Federal District, Brazil15°47′S 47°54′W﻿ / ﻿15...","Cultural:(i), (iv)",,1987,Planned and developed by Lúcio Costa and Oscar...,[4]
2,Brazilian Atlantic Islands: Fernando de Noronh...,,"Pernambuco and Rio Grande do Norte, Brazil3°51...","Natural:(vii), (ix), (x)","42,270 (104,500); buffer zone 140,713 (347,710)",2001,As one of the few insular habitats in the Sout...,[5]
3,Canaima National Park,,"Bolívar, Venezuela5°20′N 61°30′W﻿ / ﻿5.333°N 6...","Natural:(vii), (viii), (ix), (x)","3,000,000 (7,400,000)",1994,The park is characterized by table-top mountai...,[6]
4,Central Amazon Conservation Complex,,"State of Amazonas, Brazil2°20′0″S 62°0′30″W﻿ /...","Natural:(ix), (x)","5,323,018 (13,153,460)",2000[nb 1],As the largest protected area in the Amazon ba...,[7][8]


# Data Cleaning

Data cleaning Plan:

1. Delete a column Image
2. From location create two columns: Location and Coordinates
3. Remove roman numbers from Criteria
4. Return only numbers from Areaha that are not in numbers
5. Extract year from Year
6. Check if there are any information regarding elevation (ft) in Description
7. Remove brakets from Refs

In [152]:
# drops unnecessary columns
df.drop('Image', axis = 1, inplace = True)

From location create two columns: Location and Coordinates

In [153]:
def get_location(df):

    # create a new column
    df['Location_place'] = df['Location']
    
    # iterate across rows
    for row in range(df.shape[0]):
    
        # split by 1st digits encountered
        res = df['Location'].str.split(r'(\d+)')[row][0]
        
        # assign result to the created column
        df['Location_place'][row] = res
    return df

df = get_location(df)

In [154]:
def get_coordinates(df):

    # create a new column
    df['Location_coordinates'] = df['Location']
    
    # iterate across rows
    for row in range(df.shape[0]):
        
        # split by 1st digits encountered
        res = df['Location'].str.split(r'(\d+)')[row]
        
        # pop location values to get rig of it
        res.pop(0)
        
        # assign what is left to a column value
        df['Location_coordinates'][row] = ''.join(res)
    return df

df = get_coordinates(df)

Remove roman numbers from Criteria

Return only numbers from Areaha that are not in numbers

Extract year from Year

Check if there are any information regarding elevation (ft) in Description

Remove brakets from Refs

In [155]:
def further_transformations(df):
    
    # extract only words at the beginning of the column
    df['Criteria'] = df['Criteria'].str.extract(r'([a-zA-Z]+)')
    
    # Clean up column from commas
    df['Areaha (acre)'] = df['Areaha (acre)'].str.replace(',','')
    
    # get number from cleaned column
    df['Areaha (acre)'] = df['Areaha (acre)'].str.extract(r'(\d+)')
    
    # get a year from a column
    df['Year'] = df['Year'].str.extract(r'(\d+)')
    
    # remove brakets from a string
    df['Refs'] = df['Refs'].str.extract(r'(\d+)')
    return df

df = further_transformations(df)

In [158]:
def data_types_conversion(df,col):
    
    # replaces nan with 0 to facilitate type conversion
    df[col].fillna(0, inplace = True)
    
    # convert to num type
    df[col] = pd.to_numeric(df[col], errors = 'coerce')
    df[col] = df[col].astype(int)
    return df


# apply data_types_conversion to list of columns
columns_list = ['Areaha (acre)', 'Year','Refs']
for col in columns_list:
    df = data_types_conversion(df,col)

In [162]:
def dropping_columns(df):
    
    # drops columns
    columns = ['Location', 'Description']
    
    # iterates across a list to drop
    for col in columns:
        df.drop(col, axis = 1, inplace = True)
    return df

df = dropping_columns(df)

# Cleaned DataFrame

In [165]:
df.head()

Unnamed: 0,Site,Criteria,Areaha (acre),Year,Refs,Location_place,Location_coordinates
0,Atlantic Forest South-East Reserves,Natural,468193,1999,3,"Paraná, São Paulo and Rio de Janeiro states, B...",24°10′S 48°0′W﻿ / ﻿24.167°S 48.000°W
1,Brasília,Cultural,0,1987,4,"Federal District, Brazil",15°47′S 47°54′W﻿ / ﻿15.783°S 47.900°W
2,Brazilian Atlantic Islands: Fernando de Noronh...,Natural,42270,2001,5,"Pernambuco and Rio Grande do Norte, Brazil",3°51′29″S 32°25′30″W﻿ / ﻿3.85806°S 32.42500°W
3,Canaima National Park,Natural,3000000,1994,6,"Bolívar, Venezuela",5°20′N 61°30′W﻿ / ﻿5.333°N 61.500°W
4,Central Amazon Conservation Complex,Natural,5323018,2000,7,"State of Amazonas, Brazil",2°20′0″S 62°0′30″W﻿ / ﻿2.33333°S 62.00833°W
