# Data module class 2
Reading documentation: Pandas and BeautifulSoup

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [2]:
# download and import BeautifulSoup if you need to
# !pip install beautifulsoup4
# !pip install lxml

## Pandas
### Terminology reference
#### Data structures
##### 1-dimensional data (create Series)

|pandas abbreviation|definition|example|
|---|---|---|
|dict|Python dictionary|`{'a': 'value', 'b': 'value'}`|
|ndarray|N-dimensional array (can be 1 or 2 dimensional)|`[0, 1, 2, 3]`|
|scalar|Single value|`100`|
|list|Python list|`[0, 1, 2, 3]`|

##### 2-dimensional data (create DataFrames)

|pandas term|example|
|---|---|
|ndarray|`[[0, 1, 2, 3], [4, 5, 6, 7]]`|
|dict of ndarrays|`{'one': [1, 2, 3, 4], 'two': [4, 3, 2, 1]}`|
|list of dicts|`[{'id': 1, 'info': 'text'}, {'id': 2, 'info': 'more text'}]`|

#### How do these look when loaded in pandas?
[Taken from the Pandas User Guide](https://pandas.pydata.org/docs/user_guide/dsintro.html)

In [3]:
pd.Series({'a': 'value', 'b': 'value'})

a    value
b    value
dtype: object

In [4]:
pd.Series([0, 1, 2, 3])

0    0
1    1
2    2
3    3
dtype: int64

In [5]:
pd.Series(5)

0    5
dtype: int64

In [6]:
pd.DataFrame([{'id': 1, 'info': 'text'}, {'id': 2, 'info': 'more text'}])

Unnamed: 0,id,info
0,1,text
1,2,more text


#### Other terms
[See pd.to_datetime() as an example](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html#pandas.to_datetime)

#### parameters: Information that a function accepts 
- args
    - Arguments that are required (or things that the function needs in order to run)
    - i.e. data for your DataFrame
- kwargs (even though Pandas does not identify them as such)
    - Keyword arguments: optional arguments not necessary for a function to run, but will tell the function to behave in a different way than the default. Called "keyword" arguments because you have to identify the name of the variable
    - i.e. errors='raise'

### 1. Let's practice input/output with Pandas with the following links.
Use Panda's [IO Tools](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) section of their documentation to grab these datasets

- [Avengers Wikia data - FiveThirtyEight](https://raw.githubusercontent.com/fivethirtyeight/data/master/comic-characters/marvel-wikia-data.csv) | [Documentation here](https://github.com/fivethirtyeight/data/tree/master/avengers)
- [List of sovereign states - Wikipedia](https://en.wikipedia.org/wiki/List_of_sovereign_states)
- [Homeless housing - LA Times](https://raw.githubusercontent.com/kyleykim/R_Scripts/master/la-me-ln-hhh-unequal/revised_data/master_data_geocoded.csv) | [Documentation](https://github.com/kyleykim/R_Scripts/tree/master/la-me-ln-hhh-unequal)

In [7]:
df_avengers = pd.read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/comic-characters/marvel-wikia-data.csv")

In [8]:
df_un = pd.read_html("https://en.wikipedia.org/wiki/List_of_sovereign_states", header = [0])

In [9]:
df_homeless = pd.read_csv("https://raw.githubusercontent.com/kyleykim/R_Scripts/master/la-me-ln-hhh-unequal/revised_data/master_data_geocoded.csv")

### 2. Let's practice working with missing data and selecting these values
#### For each DataFrame, either select all the missing values of one column or select a unique categorical value.
The [Indexing and selecting data¶](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) section of Pandas documentation will help

#### a. Avengers

In [10]:
df_avengers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16376 entries, 0 to 16375
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   page_id           16376 non-null  int64  
 1   name              16376 non-null  object 
 2   urlslug           16376 non-null  object 
 3   ID                12606 non-null  object 
 4   ALIGN             13564 non-null  object 
 5   EYE               6609 non-null   object 
 6   HAIR              12112 non-null  object 
 7   SEX               15522 non-null  object 
 8   GSM               90 non-null     object 
 9   ALIVE             16373 non-null  object 
 10  APPEARANCES       15280 non-null  float64
 11  FIRST APPEARANCE  15561 non-null  object 
 12  Year              15561 non-null  float64
dtypes: float64(2), int64(1), object(10)
memory usage: 1.6+ MB


In [11]:
df_avengers.head(2)

Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year
0,1678,Spider-Man (Peter Parker),\/Spider-Man_(Peter_Parker),Secret Identity,Good Characters,Hazel Eyes,Brown Hair,Male Characters,,Living Characters,4043.0,Aug-62,1962.0
1,7139,Captain America (Steven Rogers),\/Captain_America_(Steven_Rogers),Public Identity,Good Characters,Blue Eyes,White Hair,Male Characters,,Living Characters,3360.0,Mar-41,1941.0


In [12]:
df_avengers.ALIGN.unique()

array(['Good Characters', 'Neutral Characters', 'Bad Characters', nan],
      dtype=object)

In [13]:
df_avengers.ALIGN.value_counts(dropna = False)

Bad Characters        6720
Good Characters       4636
NaN                   2812
Neutral Characters    2208
Name: ALIGN, dtype: int64

In [14]:
df_avengers[df_avengers.ALIGN.isna()]

Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year
80,67048,Blaine Colt (Earth-616),\/Blaine_Colt_(Earth-616),Public Identity,,Blue Eyes,Blond Hair,Male Characters,,Deceased Characters,429.0,,
118,100209,Millicent Collins (Earth-616),\/Millicent_Collins_(Earth-616),Public Identity,,Blue Eyes,Blond Hair,Female Characters,,Living Characters,321.0,Dec-45,1945.0
135,100716,Chili Storm (Earth-616),\/Chili_Storm_(Earth-616),Public Identity,,Green Eyes,Red Hair,Female Characters,Homosexual Characters,Living Characters,284.0,Oct-48,1948.0
204,18854,Gloria Grant (Earth-616),\/Gloria_Grant_(Earth-616),No Dual Identity,,Brown Eyes,Black Hair,Female Characters,,Living Characters,202.0,Jan-75,1975.0
244,15257,Redwing (Earth-616),\/Redwing_(Earth-616),Public Identity,,Yellow Eyes,,Male Characters,,Living Characters,165.0,Sep-69,1969.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
16347,462671,TOR\/test,\/User:TOR\/test,,,,,,,,,,
16348,725312,Toxin (Luminals) (Earth-616),\/Toxin_(Luminals)_(Earth-616),Secret Identity,,,No Hair,,,Living Characters,,,
16350,40061,Valka (Earth-616),\/Valka_(Earth-616),,,,,,,Living Characters,,,
16352,16569,Viridian (Earth-616),\/Viridian_(Earth-616),,,,,Male Characters,,Living Characters,,,


In [15]:
df_avengers.Year.value_counts(dropna = False)

NaN       815
1993.0    554
1994.0    485
1992.0    455
2006.0    381
         ... 
1952.0     26
1956.0     16
1957.0      7
1959.0      4
1958.0      2
Name: Year, Length: 76, dtype: int64

In [16]:
df_avengers[df_avengers.Year.isna()]

Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year
12,7823,Namor McKenzie (Earth-616),\/Namor_McKenzie_(Earth-616),No Dual Identity,Neutral Characters,Green Eyes,Black Hair,Male Characters,,Living Characters,1528.0,,
38,1677,Rogue (Anna Marie) (Earth-616),\/Rogue_(Anna_Marie)_(Earth-616),Secret Identity,Good Characters,Green Eyes,Auburn Hair,Female Characters,,Living Characters,850.0,,
80,67048,Blaine Colt (Earth-616),\/Blaine_Colt_(Earth-616),Public Identity,,Blue Eyes,Blond Hair,Male Characters,,Deceased Characters,429.0,,
114,37751,Monica Rambeau (Earth-616),\/Monica_Rambeau_(Earth-616),Secret Identity,Good Characters,Brown Eyes,Black Hair,Female Characters,,Living Characters,327.0,,
259,25255,James Bradley (Earth-616),\/James_Bradley_(Earth-616),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,158.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
16371,657508,Ru'ach (Earth-616),\/Ru%27ach_(Earth-616),No Dual Identity,Bad Characters,Green Eyes,No Hair,Male Characters,,Living Characters,,,
16372,665474,Thane (Thanos' son) (Earth-616),\/Thane_(Thanos%27_son)_(Earth-616),No Dual Identity,Good Characters,Blue Eyes,Bald,Male Characters,,Living Characters,,,
16373,695217,Tinkerer (Skrull) (Earth-616),\/Tinkerer_(Skrull)_(Earth-616),Secret Identity,Bad Characters,Black Eyes,Bald,Male Characters,,Living Characters,,,
16374,708811,TK421 (Spiderling) (Earth-616),\/TK421_(Spiderling)_(Earth-616),Secret Identity,Neutral Characters,,,Male Characters,,Living Characters,,,


#### b. Countries

In [17]:
df_un[0]

Unnamed: 0,Common and formal names,Membership within the UN System[a],Sovereignty dispute[b],Further information on status and recognition of sovereignty[d]
0,,,,
1,UN member states and observer states ↓,,,
2,Abkhazia → See Abkhazia listing,Abkhazia → See Abkhazia listing,Abkhazia → See Abkhazia listing,Abkhazia → See Abkhazia listing
3,Afghanistan – Islamic Republic of Afghanistan,UN member state,,
4,Albania – Republic of Albania,,,
...,...,...,...,...
237,South Ossetia – Republic of South Ossetia–the ...,,Georgia,"A de facto independent state,[70] recognised b..."
238,Taiwan – Republic of China[l],Former UN member and former permanent UN Secur...,People's Republic of China,A state competing (nominally) for recognition ...
239,Transnistria – Pridnestrovian Moldavian Republic,,Moldova,"A de facto independent state,[56] recognised o..."
240,,,,


In [18]:
df_un[0][df_un[0]['Common and formal names'].isna()]

Unnamed: 0,Common and formal names,Membership within the UN System[a],Sovereignty dispute[b],Further information on status and recognition of sovereignty[d]
0,,,,
227,,,,
228,,,,
240,,,,
241,,,,


In [19]:
df_un[0][df_un[0]['Membership within the UN System[a]'].isna()]

Unnamed: 0,Common and formal names,Membership within the UN System[a],Sovereignty dispute[b],Further information on status and recognition of sovereignty[d]
0,,,,
1,UN member states and observer states ↓,,,
4,Albania – Republic of Albania,,,
5,Algeria – People's Democratic Republic of Algeria,,,
6,Andorra – Principality of Andorra,,,Andorra is a co-principality in which the offi...
...,...,...,...,...
236,Somaliland – Republic of Somaliland,,Somalia,"A de facto independent state,[56][65][66][67][..."
237,South Ossetia – Republic of South Ossetia–the ...,,Georgia,"A de facto independent state,[70] recognised b..."
239,Transnistria – Pridnestrovian Moldavian Republic,,Moldova,"A de facto independent state,[56] recognised o..."
240,,,,


#### c. LA homeless housing

In [20]:
df_homeless.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79 entries, 0 to 78
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   project_name  79 non-null     object 
 1   address       79 non-null     object 
 2   district_no   79 non-null     int64  
 3   units         79 non-null     int64  
 4   sh_units      79 non-null     int64  
 5   status        79 non-null     object 
 6   lon           79 non-null     float64
 7   lat           79 non-null     float64
 8   geoAddress    79 non-null     object 
dtypes: float64(2), int64(3), object(4)
memory usage: 5.7+ KB


In [21]:
df_homeless.head()

Unnamed: 0,project_name,address,district_no,units,sh_units,status,lon,lat,geoAddress
0,Reseda Theater Senior Housing (Canby Woods West),7221 N CANBY AVE CA 91335,3,26,13,Already approved,-118.535105,34.201798,"7221 canby ave, reseda, ca 91335, usa"
1,Main Street Apartments,5501 S MAIN ST CA 90037,9,57,56,Already approved,-118.274276,33.992203,"5501 s main st, los angeles, ca 90037, usa"
2,Berendo Sage,1035 S BERENDO ST CA 90006,1,42,21,Already approved,-118.294014,34.051678,"1035 s berendo st, los angeles, ca 90006, usa"
3,South Main Street Apartments,12003 S MAIN ST CA 90061,15,62,61,Already approved,-118.27425,33.923439,"12003 s main st, los angeles, ca 90061, usa"
4,Montecito II Senior Housing,6668 W FRANKLIN AVE HOLLYWOOD CA 90028,13,64,32,Already approved,-118.335282,34.105027,"6668 franklin ave, los angeles, ca 90028, usa"


In [22]:
df_homeless.status.unique()

array(['Already approved', 'Pending City Council approval'], dtype=object)

In [23]:
df_homeless.status.value_counts()

Already approved                 55
Pending City Council approval    24
Name: status, dtype: int64

In [24]:
df_homeless[df_homeless.status == 'Pending City Council approval']

Unnamed: 0,project_name,address,district_no,units,sh_units,status,lon,lat,geoAddress
55,La Veranda,2420 E CESAR E CHAVEZ AVE 90033,14,77,38,Pending City Council approval,-118.207147,34.046205,"2420 east cesar e chavez avenue, los angeles, ..."
56,Asante Apartments,11001 S BROADWAY 90061,8,55,54,Pending City Council approval,-118.278654,33.935378,"11001 s broadway, los angeles, ca 90061, usa"
57,Weingart Tower 1B HHH PSH,554 S SAN PEDRO ST 90013,14,104,83,Pending City Council approval,-118.244668,34.042607,"554 san pedro st, los angeles, ca 90013, usa"
58,803 E. 5th Street,803 E 5TH ST 90013,14,95,94,Pending City Council approval,-118.240607,34.042475,"803 e 5th st, los angeles, ca 90013, usa"
59,Colorado East,2453 W COLORADO BLVD 90041,14,41,40,Pending City Council approval,-118.220304,34.141138,"2453 colorado blvd, los angeles, ca 90041, usa"
60,Watts Works,9502 S COMPTON AVE 90002,15,26,25,Pending City Council approval,-118.245833,33.950543,"9502 compton ave, los angeles, ca 90002, usa"
61,Los Lirios Apartments,119 S SOTO ST 90033,14,64,20,Pending City Council approval,-118.210496,34.0434,"119 s soto st, los angeles, ca 90033, usa"
62,Enlightenment Plaza - Phase I,316 N JUANITA AVE 90004,13,101,100,Pending City Council approval,-118.290141,34.076862,"316 n juanita ave, los angeles, ca 90004, usa"
63,Normandie 84,8401 S NORMANDIE AVE 90044,8,42,34,Pending City Council approval,-118.300467,33.962534,"8401 normandie ave, los angeles, ca 90044, usa"
64,11408 S. Central Avenue,11408 S CENTRAL AVE 90059,15,64,63,Pending City Council approval,-118.253916,33.93066,"11408 s central ave, los angeles, ca 90059, usa"


### 3. Let's practice cleaning with intent

#### Use each the three datasets loaded in to generate a question you want to answer with the data
##### Tips
- Show the column list the column types and null values
- Find unique values to look at categorical data

#### a. Avengers
##### Question
- What proportion of female avengers are aligned to good, compared with other sexes?

##### What steps do I need to do to answer the question?
- Look up what options for sex and alignment are available (categories)
- Drop null values if necessary
- Group alignment by sex
- Count values and normalize

In [25]:
# show the dataframe info here to get you started 
df_avengers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16376 entries, 0 to 16375
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   page_id           16376 non-null  int64  
 1   name              16376 non-null  object 
 2   urlslug           16376 non-null  object 
 3   ID                12606 non-null  object 
 4   ALIGN             13564 non-null  object 
 5   EYE               6609 non-null   object 
 6   HAIR              12112 non-null  object 
 7   SEX               15522 non-null  object 
 8   GSM               90 non-null     object 
 9   ALIVE             16373 non-null  object 
 10  APPEARANCES       15280 non-null  float64
 11  FIRST APPEARANCE  15561 non-null  object 
 12  Year              15561 non-null  float64
dtypes: float64(2), int64(1), object(10)
memory usage: 1.6+ MB


In [26]:
df_avengers.SEX.value_counts(dropna = False)

Male Characters           11638
Female Characters          3837
NaN                         854
Agender Characters           45
Genderfluid Characters        2
Name: SEX, dtype: int64

In [27]:
df_avengers = df_avengers.dropna(subset=['SEX'])

In [28]:
df_avengers.ALIGN.value_counts(dropna = False, normalize = True)
## 17% of character alignment is unknown - it might be important to know that alignment is a mystery

Bad Characters        0.408066
Good Characters       0.290813
NaN                   0.166216
Neutral Characters    0.134905
Name: ALIGN, dtype: float64

In [29]:
(df_avengers.groupby(by = 'SEX').ALIGN.value_counts(dropna = False, normalize = True)*100).to_frame(name = '%').sort_values(by = 'SEX')

Unnamed: 0_level_0,Unnamed: 1_level_0,%
SEX,ALIGN,Unnamed: 2_level_1
Agender Characters,Bad Characters,44.444444
Agender Characters,Neutral Characters,28.888889
Agender Characters,Good Characters,22.222222
Agender Characters,,4.444444
Female Characters,Good Characters,40.057336
Female Characters,Bad Characters,25.436539
Female Characters,,17.826427
Female Characters,Neutral Characters,16.679698
Genderfluid Characters,Good Characters,50.0
Genderfluid Characters,Neutral Characters,50.0


#### b. Countries
##### Question
- Which countries have a sovereignty dispute?

##### What cleaning do I need to do to answer the question
- Select the sovereignty dispute column
- Filter for all non-null values

In [30]:
df_un[0][df_un[0]['Sovereignty dispute[b]'].notna()].reset_index()

Unnamed: 0,index,Common and formal names,Membership within the UN System[a],Sovereignty dispute[b],Further information on status and recognition of sovereignty[d]
0,2,Abkhazia → See Abkhazia listing,Abkhazia → See Abkhazia listing,Abkhazia → See Abkhazia listing,Abkhazia → See Abkhazia listing
1,10,Armenia – Republic of Armenia,,Not recognised by Pakistan.,Armenia is not recognised by Pakistan due to t...
2,11,Artsakh → See Artsakh listing,Artsakh → See Artsakh listing,Artsakh → See Artsakh listing,Artsakh → See Artsakh listing
3,31,Burma → See Myanmar listing,Burma → See Myanmar listing,Burma → See Myanmar listing,Burma → See Myanmar listing
4,40,China – People's Republic of China[l],,Partially unrecognised. Republic of China,"China contains five autonomous regions, Guangx..."
5,41,"China, Republic of → See Taiwan listing","China, Republic of → See Taiwan listing","China, Republic of → See Taiwan listing","China, Republic of → See Taiwan listing"
6,46,Cook Islands → See Cook Islands listing,Cook Islands → See Cook Islands listing,Cook Islands → See Cook Islands listing,Cook Islands → See Cook Islands listing
7,48,Côte d'Ivoire → See Ivory Coast listing,Côte d'Ivoire → See Ivory Coast listing,Côte d'Ivoire → See Ivory Coast listing,Côte d'Ivoire → See Ivory Coast listing
8,51,Cyprus – Republic of Cyprus,,Not recognised by Turkey[13],Member of the EU.[c] The northeastern part of ...
9,53,Democratic People's Republic of Korea → See Ko...,Democratic People's Republic of Korea → See Ko...,Democratic People's Republic of Korea → See Ko...,Democratic People's Republic of Korea → See Ko...


#### c. LA homeless housing
##### Question
- What zip codes have the most approved homeless housing projects?

##### What cleaning do I need to do to answer the question
- Separate zipcodes from the addresses into a new columns
- Filter dataset for approved projects
- Count values

In [31]:
df_homeless.head(2)

Unnamed: 0,project_name,address,district_no,units,sh_units,status,lon,lat,geoAddress
0,Reseda Theater Senior Housing (Canby Woods West),7221 N CANBY AVE CA 91335,3,26,13,Already approved,-118.535105,34.201798,"7221 canby ave, reseda, ca 91335, usa"
1,Main Street Apartments,5501 S MAIN ST CA 90037,9,57,56,Already approved,-118.274276,33.992203,"5501 s main st, los angeles, ca 90037, usa"


In [32]:
df_homeless['zip'] = df_homeless['address'].str[-5:]

In [33]:
df_homeless.head(2)

Unnamed: 0,project_name,address,district_no,units,sh_units,status,lon,lat,geoAddress,zip
0,Reseda Theater Senior Housing (Canby Woods West),7221 N CANBY AVE CA 91335,3,26,13,Already approved,-118.535105,34.201798,"7221 canby ave, reseda, ca 91335, usa",91335
1,Main Street Apartments,5501 S MAIN ST CA 90037,9,57,56,Already approved,-118.274276,33.992203,"5501 s main st, los angeles, ca 90037, usa",90037


In [34]:
df_homeless[df_homeless.status == 'Already approved'].value_counts(subset = 'zip')

zip
90003    3
90017    3
90057    3
90038    3
90018    3
90037    3
90014    3
90007    2
91342    2
90073    2
90061    2
90006    2
90013    2
90044    2
90028    2
90026    2
90031    1
90020    1
91402    1
91352    1
90019    1
91335    1
90744    1
90291    1
90062    1
90032    1
90025    1
90047    1
90043    1
90029    1
90004    1
91606    1
dtype: int64

Take a look at the [LA Times'](https://github.com/datadesk/notebooks) or [FiveThirtyEight's](https://github.com/fivethirtyeight/data) for more practice

## BeautifulSoup
[BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [35]:
# load in the HTML and format for BS
sp_wiki_url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'

In [36]:
sp_page_r = requests.get(sp_wiki_url)

In [37]:
sp = BeautifulSoup(sp_page_r.content)

In [38]:
# find the title tag
sp.title

<title>List of S&amp;P 500 companies - Wikipedia</title>

In [39]:
# grab the first a tag
sp.a

<a id="top"></a>

In [40]:
# finds all a tags
sp.find_all('a')

[<a id="top"></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>,
 <a href="/wiki/S%26P_500" title="S&amp;P 500">S&amp;P 500</a>,
 <a href="/wiki/Stock_market_index" title="Stock market index">stock market index</a>,
 <a href="/wiki/S%26P_Dow_Jones_Indices" title="S&amp;P Dow Jones Indices">S&amp;P Dow Jones Indices</a>,
 <a href="/wiki/Common_stock" title="Common stock">common stocks</a>,
 <a href="/wiki/Market_capitalization" title="Market capitalization">large-cap</a>,
 <a href="/wiki/Dow_Jones_Industrial_Average" title="Dow Jones Industrial Average">Dow Jones Industrial Average</a>,
 <a href="#cite_note-1">[1]</a>,
 <a href="#cite_note-2">[2]</a>,
 <a href="#S&amp;P_500_component_stocks"><span class="tocnumber">1</span> <span class="toctext">S&amp;P 500 component stocks</span></a>,
 <a href="#Selected_changes_to_the_list_of_S&amp;P_500_components"><span class="tocnumber">2</span> <span class="toctext

In [41]:
# find all elements with the class "mw-jump-link"
sp.find_all(class_ = "mw-jump-link")

[<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>]

#### Format the first table of the list of S&P 500 companies wiki page as a dataframe

[Traversing the DOM - W3C](https://www.w3.org/wiki/Traversing_the_DOM)

In [42]:
# find where the data you want resides (a tag, class name, etc)
sp_table = sp.find_all('table')
# table[0]
sp_table = sp_table[0]

In [43]:
sp_trs = sp_table.find_all('tr')
sp_trs

[<tr>
 <th><a href="/wiki/Ticker_symbol" title="Ticker symbol">Symbol</a>
 </th>
 <th>Security</th>
 <th><a href="/wiki/SEC_filing" title="SEC filing">SEC filings</a></th>
 <th><a href="/wiki/Global_Industry_Classification_Standard" title="Global Industry Classification Standard">GICS</a> Sector</th>
 <th>GICS Sub-Industry</th>
 <th>Headquarters Location</th>
 <th>Date first added</th>
 <th><a href="/wiki/Central_Index_Key" title="Central Index Key">CIK</a></th>
 <th>Founded
 </th></tr>,
 <tr>
 <td><a class="external text" href="https://www.nyse.com/quote/XNYS:MMM" rel="nofollow">MMM</a>
 </td>
 <td><a href="/wiki/3M" title="3M">3M</a></td>
 <td><a class="external text" href="https://www.sec.gov/cgi-bin/browse-edgar?CIK=MMM&amp;action=getcompany" rel="nofollow">reports</a></td>
 <td>Industrials</td>
 <td>Industrial Conglomerates</td>
 <td><a href="/wiki/Saint_Paul,_Minnesota" title="Saint Paul, Minnesota">Saint Paul, Minnesota</a></td>
 <td>1976-08-09</td>
 <td>0000066740</td>
 <td>190

In [44]:
sp_ths = sp_trs[0].find_all('th')
sp_ths
#each TR = different stock; th = table headers; td = table data

[<th><a href="/wiki/Ticker_symbol" title="Ticker symbol">Symbol</a>
 </th>,
 <th>Security</th>,
 <th><a href="/wiki/SEC_filing" title="SEC filing">SEC filings</a></th>,
 <th><a href="/wiki/Global_Industry_Classification_Standard" title="Global Industry Classification Standard">GICS</a> Sector</th>,
 <th>GICS Sub-Industry</th>,
 <th>Headquarters Location</th>,
 <th>Date first added</th>,
 <th><a href="/wiki/Central_Index_Key" title="Central Index Key">CIK</a></th>,
 <th>Founded
 </th>]

In [45]:
# pull the table header from the first table row
sp_th = sp_trs[0].find_all('th')
sp_header = []
for th in sp_th:
    sp_header.append(th.text)

sp_header

['Symbol\n',
 'Security',
 'SEC filings',
 'GICS Sector',
 'GICS Sub-Industry',
 'Headquarters Location',
 'Date first added',
 'CIK',
 'Founded\n']

In [46]:
# for each table row, go to the table data => extract the text => append to new list
sp_list = [] # initialize new list
for tr in sp_trs[1:]: # for each table row
    tds = tr.find_all('td') # grab the table data
tds

[<td><a class="external text" href="https://www.nyse.com/quote/XNYS:ZTS" rel="nofollow">ZTS</a>
 </td>,
 <td><a href="/wiki/Zoetis" title="Zoetis">Zoetis</a></td>,
 <td><a class="external text" href="https://www.sec.gov/cgi-bin/browse-edgar?CIK=ZTS&amp;action=getcompany" rel="nofollow">reports</a></td>,
 <td>Health Care</td>,
 <td>Pharmaceuticals</td>,
 <td><a class="mw-redirect" href="/wiki/Parsippany,_New_Jersey" title="Parsippany, New Jersey">Parsippany, New Jersey</a></td>,
 <td>2013-06-21</td>,
 <td>0001555280</td>,
 <td>1952
 </td>]

In [47]:
    tr_list = [] #initialize a table data list
    for (i, td) in enumerate(tds):
            tr_list.append(td.text)
    sp_list.append(tr_list)

In [48]:
pd.DataFrame(sp_list, columns=sp_header)

Unnamed: 0,Symbol\n,Security,SEC filings,GICS Sector,GICS Sub-Industry,Headquarters Location,Date first added,CIK,Founded\n
0,ZTS\n,Zoetis,reports,Health Care,Pharmaceuticals,"Parsippany, New Jersey",2013-06-21,1555280,1952\n


In [49]:
# SEC filings column contains links
# re-run for loop to get a href in the 'SEC filings'
sp_list = []
for tr in sp_trs[1:]: 
    tds = tr.find_all('td') 
    tr_list = [] 
    for (i, td) in enumerate(tds):
        if i == 2:   # if it's the third column, get the href link instead of the text
            tr_list.append(td.find('a')['href'])
        else:
            tr_list.append(td.text)
    sp_list.append(tr_list)

In [50]:
df_sp500 = pd.DataFrame(sp_list, columns=sp_header)

In [51]:
df_sp500.to_csv('formatted_data/sp500_30-June-2021.csv', index=False)

### We can do more cleaning here

In [52]:
df_sp500

Unnamed: 0,Symbol\n,Security,SEC filings,GICS Sector,GICS Sub-Industry,Headquarters Location,Date first added,CIK,Founded\n
0,MMM\n,3M,https://www.sec.gov/cgi-bin/browse-edgar?CIK=M...,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1976-08-09,0000066740,1902\n
1,ABT\n,Abbott Laboratories,https://www.sec.gov/cgi-bin/browse-edgar?CIK=A...,Health Care,Health Care Equipment,"North Chicago, Illinois",1964-03-31,0000001800,1888\n
2,ABBV\n,AbbVie,https://www.sec.gov/cgi-bin/browse-edgar?CIK=A...,Health Care,Pharmaceuticals,"North Chicago, Illinois",2012-12-31,0001551152,2013 (1888)\n
3,ABMD\n,Abiomed,https://www.sec.gov/cgi-bin/browse-edgar?CIK=A...,Health Care,Health Care Equipment,"Danvers, Massachusetts",2018-05-31,0000815094,1981\n
4,ACN\n,Accenture,https://www.sec.gov/cgi-bin/browse-edgar?CIK=A...,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,0001467373,1989\n
...,...,...,...,...,...,...,...,...,...
500,YUM\n,Yum! Brands,https://www.sec.gov/cgi-bin/browse-edgar?CIK=Y...,Consumer Discretionary,Restaurants,"Louisville, Kentucky",1997-10-06,0001041061\n,1997\n
501,ZBRA\n,Zebra Technologies,https://www.sec.gov/cgi-bin/browse-edgar?CIK=Z...,Information Technology,Electronic Equipment & Instruments,"Lincolnshire, Illinois",2019-12-23,0000877212\n,1969\n
502,ZBH\n,Zimmer Biomet,https://www.sec.gov/cgi-bin/browse-edgar?CIK=Z...,Health Care,Health Care Equipment,"Warsaw, Indiana",2001-08-07,0001136869\n,1927\n
503,ZION\n,Zions Bancorp,https://www.sec.gov/cgi-bin/browse-edgar?CIK=Z...,Financials,Regional Banks,"Salt Lake City, Utah",2001-06-22,0000109380\n,1873\n


In [53]:
df_sp500.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 505 entries, 0 to 504
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Symbol
                505 non-null    object
 1   Security               505 non-null    object
 2   SEC filings            505 non-null    object
 3   GICS Sector            505 non-null    object
 4   GICS Sub-Industry      505 non-null    object
 5   Headquarters Location  505 non-null    object
 6   Date first added       505 non-null    object
 7   CIK                    505 non-null    object
 8   Founded
               505 non-null    object
dtypes: object(9)
memory usage: 35.6+ KB


In [54]:
#rename Symbol and Founded columns to remov \n
df_sp500 = df_sp500.rename(columns = {
    'Symbol\n' : 'Symbol', 'Founded\n' : 'Founded'})
df_sp500.head(2)

Unnamed: 0,Symbol,Security,SEC filings,GICS Sector,GICS Sub-Industry,Headquarters Location,Date first added,CIK,Founded
0,MMM\n,3M,https://www.sec.gov/cgi-bin/browse-edgar?CIK=M...,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1976-08-09,66740,1902\n
1,ABT\n,Abbott Laboratories,https://www.sec.gov/cgi-bin/browse-edgar?CIK=A...,Health Care,Health Care Equipment,"North Chicago, Illinois",1964-03-31,1800,1888\n


In [55]:
#remove \n from Symbol and Founded columns

In [56]:
df_sp500['Symbol'] = df_sp500['Symbol'].str[-2:]

In [57]:
df_sp500['Founded'] = df_sp500['Founded'].str[-2:]

In [58]:
df_sp500.head(2)

Unnamed: 0,Symbol,Security,SEC filings,GICS Sector,GICS Sub-Industry,Headquarters Location,Date first added,CIK,Founded
0,M\n,3M,https://www.sec.gov/cgi-bin/browse-edgar?CIK=M...,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1976-08-09,66740,2\n
1,T\n,Abbott Laboratories,https://www.sec.gov/cgi-bin/browse-edgar?CIK=A...,Health Care,Health Care Equipment,"North Chicago, Illinois",1964-03-31,1800,8\n
