# Data module class 2
Reading documentation: Pandas and BeautifulSoup

In [21]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [86]:
# download and import BeautifulSoup if you need to
# !pip install beautifulsoup4

## Pandas
### Terminology reference
#### Data structures
##### 1-dimensional data (create Series)

|pandas abbreviation|definition|example|
|---|---|---|
|dict|Python dictionary|`{'a': 'value', 'b': 'value'}`|
|ndarray|N-dimensional array (can be 1 or 2 dimensional)|`[0, 1, 2, 3]`|
|scalar|Single value|`100`|
|list|Python list|`[0, 1, 2, 3]`|

##### 2-dimensional data (create DataFrames)

|pandas term|example|
|---|---|
|ndarray|`[[0, 1, 2, 3], [4, 5, 6, 7]]`|
|dict of ndarrays|`{'one': [1, 2, 3, 4], 'two': [4, 3, 2, 1]}`|
|list of dicts|`[{'id': 1, 'info': 'text'}, {'id': 2, 'info': 'more text'}]`|

#### How do these look when loaded in pandas?
[Taken from the Pandas User Guide](https://pandas.pydata.org/docs/user_guide/dsintro.html)

In [55]:
pd.Series({'a': 'value', 'b': 'value'})

a    value
b    value
dtype: object

In [56]:
pd.Series([0, 1, 2, 3])

0    0
1    1
2    2
3    3
dtype: int64

In [57]:
pd.Series(5)

0    5
dtype: int64

In [80]:
pd.DataFrame([{'id': 1, 'info': 'text'}, {'id': 2, 'info': 'more text'}])

Unnamed: 0,id,info
0,1,text
1,2,more text


In [81]:
pd.DataFrame([[0, 1, 2, 3], [4, 5, 6, 7]])

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7


#### Other terms
[See pd.DataFrame() as an example](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html#pandas.to_datetime)

- parameters
    - Information that a function accepts 
- args
    - Arguments that are required (or things that the function needs in order to run)
    - i.e. data for your DataFrame
- kwargs (even though Pandas does not identify them as such)
    - Keyword arguments: optional arguments not necessary for a function to run, but will tell the function to behave in a different way than the default. Called "keyword" arguments because you have to identify the name of the variable
    - i.e. errors='raise'

### 1. Let's practice input/output with Pandas with the following links.
Use Panda's [IO Tools](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) section of their documentation to grab these datasets

- [Avengers Wikia data - FiveThirtyEight](https://raw.githubusercontent.com/fivethirtyeight/data/master/comic-characters/marvel-wikia-data.csv) | [Documentation here](https://github.com/fivethirtyeight/data/tree/master/avengers)
- [List of sovereign states - Wikipedia](https://en.wikipedia.org/wiki/List_of_sovereign_states)
- [Homeless housing - LA Times](https://raw.githubusercontent.com/kyleykim/R_Scripts/master/la-me-ln-hhh-unequal/revised_data/master_data_geocoded.csv) | [Documentation](https://github.com/kyleykim/R_Scripts/tree/master/la-me-ln-hhh-unequal)

In [87]:
df_avengers = pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/comic-characters/marvel-wikia-data.csv')

In [88]:
df_countries = pd.read_html('https://en.wikipedia.org/wiki/List_of_sovereign_states')

In [107]:
df_homeless_housing = pd.read_csv('https://raw.githubusercontent.com/kyleykim/R_Scripts/master/la-me-ln-hhh-unequal/revised_data/master_data_geocoded.csv')

### 2. Let's practice working with missing data and selecting these values
#### For each DataFrame, either select all the missing values of one column or select a unique categorical value.
The [Indexing and selecting data¶](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) section of Pandas documentation will help

#### a. Avengers

In [90]:
df_avengers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16376 entries, 0 to 16375
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   page_id           16376 non-null  int64  
 1   name              16376 non-null  object 
 2   urlslug           16376 non-null  object 
 3   ID                12606 non-null  object 
 4   ALIGN             13564 non-null  object 
 5   EYE               6609 non-null   object 
 6   HAIR              12112 non-null  object 
 7   SEX               15522 non-null  object 
 8   GSM               90 non-null     object 
 9   ALIVE             16373 non-null  object 
 10  APPEARANCES       15280 non-null  float64
 11  FIRST APPEARANCE  15561 non-null  object 
 12  Year              15561 non-null  float64
dtypes: float64(2), int64(1), object(10)
memory usage: 1.6+ MB


In [91]:
# select null values from each column that might be missing values
df_avengers['ID'].unique()

array(['Secret Identity', 'Public Identity', 'No Dual Identity',
       'Known to Authorities Identity', nan], dtype=object)

In [92]:
df_avengers[df_avengers['ID'].isna()]

Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year
467,2028,Arthur Parks (Earth-616),\/Arthur_Parks_(Earth-616),,Bad Characters,Variable Eyes,Variable Hair,Male Characters,,Living Characters,88.0,Nov-66,1966.0
536,65598,Kathryn Cushing (Earth-616),\/Kathryn_Cushing_(Earth-616),,,Blue Eyes,Blond Hair,Female Characters,,Living Characters,72.0,Nov-85,1985.0
573,2159,Calvin Rankin (Earth-616),\/Calvin_Rankin_(Earth-616),,Good Characters,Brown Eyes,Brown Hair,Male Characters,,Living Characters,67.0,Apr-66,1966.0
577,2526,Shadow King (Earth-616),\/Shadow_King_(Earth-616),,Bad Characters,Red Eyes,No Hair,Male Characters,,Living Characters,67.0,Jan-79,1979.0
605,16087,Arthur Stacy (Earth-616),\/Arthur_Stacy_(Earth-616),,Good Characters,Blue Eyes,Grey Hair,Male Characters,,Living Characters,63.0,Feb-71,1971.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
16353,16296,William Burke (Earth-616),\/William_Burke_(Earth-616),,,,,Male Characters,,Living Characters,,,
16354,120833,William Falsworth (Earth-616),\/William_Falsworth_(Earth-616),,Good Characters,,,Male Characters,,Deceased Characters,,,
16365,684262,K'thol (Earth-616),\/K%27thol_(Earth-616),,Good Characters,,,Male Characters,,Deceased Characters,,,
16370,674414,Phoenix's Shadow (Earth-616),\/Phoenix%27s_Shadow_(Earth-616),,Neutral Characters,,,,,Living Characters,,,


In [95]:
df_avengers['ALIVE'].unique()

array(['Living Characters', 'Deceased Characters', nan], dtype=object)

In [97]:
df_avengers[df_avengers['ALIVE'].isna()]

Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year
16293,541449,Mj7711,\/User:Mj7711,,,,,,,,,,
16329,714409,Sharjeel786,\/User:Sharjeel786,,,,,,,,,,
16347,462671,TOR\/test,\/User:TOR\/test,,,,,,,,,,


#### b. countries

In [99]:
# there is a list of two DataFrames
df_countries[0]

Unnamed: 0,Common and formal names,Membership within the UN System[a],Sovereignty dispute[b],Further information on status and recognition of sovereignty[d]
0,,,,
1,UN member states and observer states ↓,,,
2,Abkhazia → See Abkhazia listing,Abkhazia → See Abkhazia listing,Abkhazia → See Abkhazia listing,Abkhazia → See Abkhazia listing
3,Afghanistan – Islamic Republic of Afghanistan,UN member state,,
4,Albania – Republic of Albania,,,
...,...,...,...,...
237,South Ossetia – Republic of South Ossetia–the ...,,Georgia,"A de facto independent state,[70] recognised b..."
238,Taiwan – Republic of China[l],Former UN member and former permanent UN Secur...,People's Republic of China,A state competing (nominally) for recognition ...
239,Transnistria – Pridnestrovian Moldavian Republic,,Moldova,"A de facto independent state,[56] recognised o..."
240,,,,


In [101]:
df_countries[0].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 242 entries, 0 to 241
Data columns (total 4 columns):
 #   Column                                                           Non-Null Count  Dtype 
---  ------                                                           --------------  ----- 
 0   Common and formal names                                          237 non-null    object
 1   Membership within the UN System[a]                               36 non-null     object
 2   Sovereignty dispute[b]                                           47 non-null     object
 3   Further information on status and recognition of sovereignty[d]  134 non-null    object
dtypes: object(4)
memory usage: 7.7+ KB


In [102]:
df_countries[0][df_countries[0]['Common and formal names'].isna()]

Unnamed: 0,Common and formal names,Membership within the UN System[a],Sovereignty dispute[b],Further information on status and recognition of sovereignty[d]
0,,,,
227,,,,
228,,,,
240,,,,
241,,,,


#### c. LA homeless housing

In [109]:
df_homeless_housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79 entries, 0 to 78
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   project_name  79 non-null     object 
 1   address       79 non-null     object 
 2   district_no   79 non-null     int64  
 3   units         79 non-null     int64  
 4   sh_units      79 non-null     int64  
 5   status        79 non-null     object 
 6   lon           79 non-null     float64
 7   lat           79 non-null     float64
 8   geoAddress    79 non-null     object 
dtypes: float64(2), int64(3), object(4)
memory usage: 5.7+ KB


In [110]:
df_homeless_housing['status'].unique()

array(['Already approved', 'Pending City Council approval'], dtype=object)

In [111]:
df_homeless_housing[df_homeless_housing['status'] == 'Pending City Council approval']

Unnamed: 0,project_name,address,district_no,units,sh_units,status,lon,lat,geoAddress
55,La Veranda,2420 E CESAR E CHAVEZ AVE 90033,14,77,38,Pending City Council approval,-118.207147,34.046205,"2420 east cesar e chavez avenue, los angeles, ..."
56,Asante Apartments,11001 S BROADWAY 90061,8,55,54,Pending City Council approval,-118.278654,33.935378,"11001 s broadway, los angeles, ca 90061, usa"
57,Weingart Tower 1B HHH PSH,554 S SAN PEDRO ST 90013,14,104,83,Pending City Council approval,-118.244668,34.042607,"554 san pedro st, los angeles, ca 90013, usa"
58,803 E. 5th Street,803 E 5TH ST 90013,14,95,94,Pending City Council approval,-118.240607,34.042475,"803 e 5th st, los angeles, ca 90013, usa"
59,Colorado East,2453 W COLORADO BLVD 90041,14,41,40,Pending City Council approval,-118.220304,34.141138,"2453 colorado blvd, los angeles, ca 90041, usa"
60,Watts Works,9502 S COMPTON AVE 90002,15,26,25,Pending City Council approval,-118.245833,33.950543,"9502 compton ave, los angeles, ca 90002, usa"
61,Los Lirios Apartments,119 S SOTO ST 90033,14,64,20,Pending City Council approval,-118.210496,34.0434,"119 s soto st, los angeles, ca 90033, usa"
62,Enlightenment Plaza - Phase I,316 N JUANITA AVE 90004,13,101,100,Pending City Council approval,-118.290141,34.076862,"316 n juanita ave, los angeles, ca 90004, usa"
63,Normandie 84,8401 S NORMANDIE AVE 90044,8,42,34,Pending City Council approval,-118.300467,33.962534,"8401 normandie ave, los angeles, ca 90044, usa"
64,11408 S. Central Avenue,11408 S CENTRAL AVE 90059,15,64,63,Pending City Council approval,-118.253916,33.93066,"11408 s central ave, los angeles, ca 90059, usa"


### 3. Let's practice cleaning with intent

#### Use each the three datasets loaded in to generate a question you want to answer with the data
##### Tips
- Show the column list the column types and null values
- Find unique values to look at categorical data

#### a. Avengers
##### Question
- How many Avengers have blonde hair? 

##### What cleaning do I need to do to answer the question
- Look up what options of hair color are available (categories)
- ID different spellings
- Check count

In [42]:
# show the dataframe info here to get you started 
df_avengers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16376 entries, 0 to 16375
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   page_id           16376 non-null  int64  
 1   name              16376 non-null  object 
 2   urlslug           16376 non-null  object 
 3   ID                12606 non-null  object 
 4   ALIGN             13564 non-null  object 
 5   EYE               6609 non-null   object 
 6   HAIR              12112 non-null  object 
 7   SEX               15522 non-null  object 
 8   GSM               90 non-null     object 
 9   ALIVE             16373 non-null  object 
 10  APPEARANCES       15280 non-null  float64
 11  FIRST APPEARANCE  15561 non-null  object 
 12  Year              15561 non-null  float64
dtypes: float64(2), int64(1), object(10)
memory usage: 1.6+ MB


In [46]:
# Look up what options of hair color are available (categories)
pd.DataFrame(df_avengers['HAIR'].unique())

Unnamed: 0,0
0,Brown Hair
1,White Hair
2,Black Hair
3,Blond Hair
4,No Hair
5,Blue Hair
6,Red Hair
7,Bald
8,Auburn Hair
9,Grey Hair


In [47]:
# ID different spellings
# Let's create a broader category for hair

In [48]:
# Check count
len(df_avengers[df_avengers['HAIR'] == 'Blond Hair'])

1582

#### b. Countries
##### Question
- _your question here_

##### What cleaning do I need to do to answer the question
- 
- 
- 

In [106]:
df_countries[0].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 242 entries, 0 to 241
Data columns (total 4 columns):
 #   Column                                                           Non-Null Count  Dtype 
---  ------                                                           --------------  ----- 
 0   Common and formal names                                          237 non-null    object
 1   Membership within the UN System[a]                               36 non-null     object
 2   Sovereignty dispute[b]                                           47 non-null     object
 3   Further information on status and recognition of sovereignty[d]  134 non-null    object
dtypes: object(4)
memory usage: 7.7+ KB


#### c. LA homeless housing
##### Question
- _your question here_

##### What cleaning do I need to do to answer the question
- 
- 
- 

In [108]:
df_homeless_housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79 entries, 0 to 78
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   project_name  79 non-null     object 
 1   address       79 non-null     object 
 2   district_no   79 non-null     int64  
 3   units         79 non-null     int64  
 4   sh_units      79 non-null     int64  
 5   status        79 non-null     object 
 6   lon           79 non-null     float64
 7   lat           79 non-null     float64
 8   geoAddress    79 non-null     object 
dtypes: float64(2), int64(3), object(4)
memory usage: 5.7+ KB


Take a look at the [LA Times'](https://github.com/datadesk/notebooks) or [FiveThirtyEight's](https://github.com/fivethirtyeight/data) for more practice

## BeautifulSoup
[BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [72]:
# load in the HTML and format for BS
sp_wiki_url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'

In [23]:
sp_page_r = requests.get(sp_wiki_url)

In [25]:
sp_bs = BeautifulSoup(sp_page_r.content)

In [71]:
# finds the title tag
sp_bs.title

<title>List of S&amp;P 500 companies - Wikipedia</title>

In [42]:
# grabs the first a tag
sp_bs.a

<a id="top"></a>

In [112]:
# finds all a tags
len(sp_bs.find_all('a'))

3562

In [73]:
# find all elements with the class "mw-jump-link"
sp_bs.find_all(class_='mw-jump-link')

[<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>]

### Traverse the DOM

In [74]:
# we know the table we want is the first table in the DOM
# then we want to to read tr tags as groups of cells in a row and td tags as cells

In [47]:
# find where the data you want resides (a tag, class name, etc)
sp_table = sp_bs.find_all('table')
# table[0]
sp_table = sp_table[0]

In [48]:
# find_all tr
# len(sp_table.find_all('tr'))
sp_trs = sp_table.find_all('tr')

In [75]:
# separate the first tr tag row for the header
sp_th = sp_trs[0].find_all('th')
sp_header = []
for th in sp_th:
    sp_header.append(th.text)

In [51]:
# for each tr, find tds then for each td get text inside, then save to new array
sp_list = []
for tr in sp_trs[1:]:
    tds = tr.find_all('td')
    tr_list = []
    for (i, td) in enumerate(tds):
        # if it's the third column, get the href link instead of the text
        if(i == 2):
            tr_list.append(td.find('a')['href'])
        else:
            tr_list.append(td.text)
    sp_list.append(tr_list)

In [77]:
sp_df = pd.DataFrame(sp_list, columns=sp_header)

In [78]:
sp_df.to_csv('formatted_data/2021-06-29_sp500.csv', index=False)

#### We can do more cleaning here