# Final Project - Milestone 2: Exploratory Data Analysis
## Shreya Kamath
### Project Source: https://www.iii.org/fact-statistic/facts-statistics-sports-injuries
#### Purpose: In this Notebook, I will be using BeautifulSoup to get scrape tables containing data on sports injuries off my source webpage, then Pandas to convert the scraped data into DataFrames that I can perform an EDA on
#### Note: I mostly used this tutorial (https://www.youtube.com/watch?v=8dTpNajxaH0) as a guide to scrape all of my tables

# Part 1: Importing Necessary Packages

In [110]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

# Part 2: Creating + Testing Soup Object and Soup Object's Source URL's Status

In [111]:
# Creating a Soup Object to parse from the source link
url = 'https://www.iii.org/fact-statistic/facts-statistics-sports-injuries'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html')

In [121]:
# Checking to make sure the URL returns an 'OK' message, allowing me to scrape the wesbite for data
requests.get(url)

<Response [200]>

In [None]:
soup

# Part 3: Creating Table 1

In [5]:
# Using indexing to find the specific table on the webpage I want to scrape
table1 = soup.find_all('table')[0]

In [None]:
table1

In [9]:
titles1 = table1.find_all('th') # Searching for all header tags to get the names of column titles for my table
table1_titles = [title.text.strip() for title in titles1] # Stripping the tags of any plaintext
table1_titles = table1_titles[3:] #Slicing to remove random empty tags (structure of table on website) from my list of column names

In [113]:
table1_titles

['Sport, activity or equipment',
 'Injuries (1)',
 'Younger than 5',
 '5 to 14',
 '15 to 24',
 '25 to 64',
 '65 and older']

In [115]:
# Creating a dataframe with column titles that are the same as the table headers
df1 = pd.DataFrame(columns = table1_titles)

In [116]:
df1

Unnamed: 0,"Sport, activity or equipment",Injuries (1),Younger than 5,5 to 14,15 to 24,25 to 64,65 and older


In [118]:
# Getting all the row data between the general table data tags 
col_data1 = table1.find_all('tr')
for row in col_data1[3:]: #Slicing to prevent random empty lists of row data from being added to the table
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data]

    #Determining the length of the dataframe, and adding the list of table data in as a row if the lengths match
    length = len(df1)
    df1.loc[length] = individual_row_data

In [119]:
df1

Unnamed: 0,"Sport, activity or equipment",Injuries (1),Younger than 5,5 to 14,15 to 24,25 to 64,65 and older
0,"Exercise, exercise equipment",482886,7750,40592,95671,247518,91354
1,Bicycles and accessories,405688,13312,92776,54207,195805,49588
2,Basketball,332391,1573,114123,149816,64915,1964
3,"ATV's, mopeds, minibikes, etc.",269657,3827,44487,76485,131303,13555
4,Football,263585,560,140877,101796,19605,746
5,"Skateboards, scooters, hoverboards",221313,6528,60376,58480,88507,7422
6,Soccer,212423,2069,101072,75978,32140,1164
7,Playground equipment,190942,49233,125692,5373,9019,1624
8,"Swimming, pools, equipment",166011,20254,66420,24348,42791,12198
9,"Baseball, softball",139940,2399,59255,42973,31789,3524


In [163]:
df1.to_csv('sportsInjuries_uncleaned.csv', header=True, index=False)

# Creating Table 2

In [123]:
# Using indexing to find the specific table on the webpage I want to scrape
table2 = soup.find_all('table')[2]

In [None]:
table2

In [22]:
titles2 = table2.find_all('th') # Searching for all header tags to get the names of column titles for my table
table2_titles = [title.text.strip() for title in titles2] # Stripping the tags of any plaintext

In [125]:
table2_titles

['\ufeffYear',
 'Total',
 'Pedalcyclist',
 'Pedalcyclists as a\npercent of\ntotal fatalities']

In [126]:
# Creating a dataframe with column titles that are the same as the table headers
df2 = pd.DataFrame(columns = table2_titles)

In [127]:
df2

Unnamed: 0,﻿Year,Total,Pedalcyclist,Pedalcyclists as a\npercent of\ntotal fatalities


In [128]:
# Getting all the row data between the general table data tags
col_data2 = table2.find_all('tr')
for row in col_data2[2:]: #Slicing to prevent random empty lists of row data from being added to the table
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data]
    

    #Determining the length of the dataframe, and adding the list of table data in as a row if the lengths match
    length = len(df2)
    df2.loc[length] = individual_row_data

In [129]:
df2

Unnamed: 0,﻿Year,Total,Pedalcyclist,Pedalcyclists as a\npercent of\ntotal fatalities
0,2013,32893,749,2.3%
1,2014,32744,729,2.2
2,2015,35484,829,2.3
3,2016,37806,853,2.3
4,2017,37473,806,2.2
5,2018,36835,871,2.4
6,2019,36355,859,2.4
7,2020,39007,948,2.4
8,2021,43230,976,2.3
9,2022 (2),42514,1105,2.6


In [164]:
df2.to_csv('motorFatalities_uncleaned.csv', header=True, index=False)

# Creating Table 3

In [130]:
# Using indexing to find the specific table on the webpage I want to scrape
table3 = soup.find_all('table')[6]

In [None]:
table3

In [132]:
titles3 = table3.find_all('th') # Searching for all header tags to get the names of column titles for my table
table3_titles = [title.text.strip() for title in titles3] # Stripping the tags of any plaintext
table3_titles = table3_titles[3:] #Slicing to remove random empty tags (structure of table on website) from my list of column names

In [133]:
table3_titles

['State',
 'Resident population\n(000)',
 'Total traffic\nfatalities',
 'Fatalities',
 'Percent of total\ntraffic fatalities',
 'Fatalities\xa0\xa0\n100,000 population']

In [138]:
# Creating a dataframe with column titles that are the same as the table headers
df3 = pd.DataFrame(columns = table3_titles)

In [139]:
df3

Unnamed: 0,State,Resident population\n(000),Total traffic\nfatalities,Fatalities,Percent of total\ntraffic fatalities,"Fatalities \n100,000 population"


In [140]:
# Getting all the row data between the general table data tags 
col_data3 = table3.find_all('tr')
for row in col_data3[3:]: #Slicing to prevent random empty lists of row data from being added to the table
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data]

    #Determining the length of the dataframe, and adding the list of table data in as a row if the lengths match
    length = len(df3)
    df3.loc[length] = individual_row_data

In [141]:
df3

Unnamed: 0,State,Resident population\n(000),Total traffic\nfatalities,Fatalities,Percent of total\ntraffic fatalities,"Fatalities \n100,000 population"
0,Alabama,5074,988,14,1.4%,0.28
1,Alaska,734,82,2,2.40,0.27
2,Arizona,7359,1302,50,3.80,0.68
3,Arkansas,3046,643,6,0.90,0.20
4,California,39029,4428,177,4.00,0.45
5,Colorado,5840,764,15,2.00,0.26
6,Connecticut,3626,359,3,0.80,0.08
7,Delaware,1018,162,6,3.70,0.59
8,District of Columbia,672,32,3,9.40,0.45
9,Florida,22245,3530,222,6.30,1.00


In [165]:
df3.to_csv('stateFatalities_uncleaned.csv', header=True, index=False)

# Creating Table 4

In [142]:
# Using indexing to find the specific table on the webpage I want to scrape
table4 = soup.find_all('table')[8]

In [None]:
table4

In [144]:
titles4 = table4.find_all('th') # Searching for all header tags to get the names of column titles for my table
table4_titles = [title.text.strip() for title in titles4] # Stripping the tags of any plaintext
table4_titles = table4_titles[5:] #Slicing to remove random empty tags (structure of table on website) from my list of column names

In [145]:
table4_titles

['City (2)',
 'Resident\npopulation',
 'Total traffic\nfatalities',
 'Fatalities',
 'As a percent\nof total\ntraffic fatalities',
 'Total',
 'Pedalcyclist',
 'Pedalcyclist\nrank (3)']

In [146]:
# Creating a dataframe with column titles that are the same as the table headers
df4 = pd.DataFrame(columns = table4_titles)

In [147]:
df4

Unnamed: 0,City (2),Resident\npopulation,Total traffic\nfatalities,Fatalities,As a percent\nof total\ntraffic fatalities,Total,Pedalcyclist,Pedalcyclist\nrank (3)


In [148]:
# Getting all the row data between the general table data tags 
col_data4 = table4.find_all('tr')
for row in col_data4[3:]: #Slicing to prevent random empty lists of row data from being added to the table
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data]

    #Determining the length of the dataframe, and adding the list of table data in as a row if the lengths match
    length = len(df4)
    df4.loc[length] = individual_row_data

In [149]:
df4

Unnamed: 0,City (2),Resident\npopulation,Total traffic\nfatalities,Fatalities,As a percent\nof total\ntraffic fatalities,Total,Pedalcyclist,Pedalcyclist\nrank (3)
0,"New York, NY",8335897,238,20,8.4%,2.86,0.24,28
1,"Los Angeles, CA",3822238,354,20,5.6,9.26,0.52,13
2,"Chicago, IL",2665039,192,10,5.2,7.2,0.38,22
3,"Houston, TX",2302878,323,11,3.4,14.03,0.48,16
4,"Phoenix, AZ",1644409,311,19,6.1,18.91,1.16,3
5,"Philadelphia, PA",1567258,142,3,2.1,9.06,0.19,29
6,"San Antonio, TX",1472909,203,8,3.9,13.78,0.54,12
7,"San Diego, CA",1381162,118,2,1.7,8.54,0.14,33
8,"Dallas, TX",1299544,228,5,2.2,17.54,0.38,21
9,"Austin, TX",974447,119,1,0.8,12.21,0.10,35


In [166]:
df4.to_csv('cityFatalities_uncleaned.csv', header=True, index=False)

# Creating Table 5

In [None]:
# Creating a Soup Object to parse from the source link
url2 = 'https://www.iii.org/table-archive/20657' #Need a second link to access another set of tables on the same site
page2 = requests.get(url2)
soup2 = BeautifulSoup(page2.text, 'html')

In [151]:
# Checking to make sure the URL returns an 'OK' message, allowing me to scrape the wesbite for data
requests.get(url2)

<Response [200]>

In [None]:
soup2

In [153]:
# Using indexing to find the specific table on the webpage I want to scrape
table5 = soup2.find_all('table')[22]

In [None]:
table5

In [155]:
titles5 = table5.find_all('th') # Searching for all header tags to get the names of column titles for my table
table5_titles = [title.text.strip() for title in titles5] # Stripping the tags of any plaintext
table5_titles = table5_titles[8:] #Slicing to remove random empty tags (structure of table on website) from my list of column names

In [156]:
table5_titles

['Year',
 'Total',
 'Number',
 'Percent\nof total',
 'Total',
 'Number',
 'Percent\nof total']

In [157]:
# Creating a dataframe with column titles that are the same as the table headers
df5 = pd.DataFrame(columns = table5_titles)

In [158]:
df5

Unnamed: 0,Year,Total,Number,Percent\nof total,Total.1,Number.1,Percent\nof total.1


In [159]:
# Getting all the row data between the general table data tags 
col_data5 = table5.find_all('tr')
for row in col_data5[4:]: #Slicing to prevent random empty lists of row data from being added to the table
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data]

    #Determining the length of the dataframe, and adding the list of table data in as a row if the lengths match
    length = len(df5)
    df5.loc[length] = individual_row_data

In [160]:
df5

Unnamed: 0,Year,Total,Number,Percent\nof total,Total.1,Number.1,Percent\nof total.1
0,2014,588,73,12%,93700,24800,26%
1,2015,593,88,15,97200,26700,28
2,2016,591,65,11,101200,26800,26
3,2017,463,67,14,93800,24800,26
4,2018,264,27,10,81800,21700,26


In [167]:
df5.to_csv('atvIncidents_uncleaned.csv', header=True, index=False)

# Part 4: Dataframe Properties and Exploratory Data Analysis
## df1

In [89]:
df1.describe()

Unnamed: 0,"Sport, activity or equipment",Injuries (1),Younger than 5,5 to 14,15 to 24,25 to 64,65 and older
count,22,22,22,22,22,22,22
unique,22,22,22,22,22,21,22
top,"Exercise, exercise equipment",482886,7750,40592,95671,8645,91354
freq,1,1,1,1,1,2,1


### Notes: Everything looks good, but the fact that there are two non-unique values in the 25-64 category may need some data cleaning

In [91]:
df1.shape

(22, 7)

In [92]:
df1.dtypes

Sport, activity or equipment    object
Injuries (1)                    object
Younger than 5                  object
5 to 14                         object
15 to 24                        object
25 to 64                        object
65 and older                    object
dtype: object

### Notes: Surprised numerical values are not int values, may need to typecast before performing any type of calculations

In [93]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 22 entries, 0 to 21
Data columns (total 7 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Sport, activity or equipment  22 non-null     object
 1   Injuries (1)                  22 non-null     object
 2   Younger than 5                22 non-null     object
 3   5 to 14                       22 non-null     object
 4   15 to 24                      22 non-null     object
 5   25 to 64                      22 non-null     object
 6   65 and older                  22 non-null     object
dtypes: object(7)
memory usage: 1.4+ KB


### Notes: No missing info, which is good

## df2

In [94]:
df2.describe()

Unnamed: 0,﻿Year,Total,Pedalcyclist,Pedalcyclists as a\npercent of\ntotal fatalities
count,10,10,10,10.0
unique,10,10,10,5.0
top,2013,32893,749,2.3
freq,1,1,1,3.0


### Notes: The number of non-unique values in the fourth column raises some concerns, also data cleaning actions will need to be taken to fix the column names

In [95]:
df2.shape

(10, 4)

In [96]:
df2.dtypes

﻿Year                                               object
Total                                               object
Pedalcyclist                                        object
Pedalcyclists as a\npercent of\ntotal fatalities    object
dtype: object

### Note: Any numerical data will need to be typecasted prior to making calculations

In [97]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column                                          Non-Null Count  Dtype 
---  ------                                          --------------  ----- 
 0   ﻿Year                                           10 non-null     object
 1   Total                                           10 non-null     object
 2   Pedalcyclist                                    10 non-null     object
 3   Pedalcyclists as a
percent of
total fatalities  10 non-null     object
dtypes: object(4)
memory usage: 400.0+ bytes


### Notes: No empty cells

## df3

In [98]:
df3.describe()

Unnamed: 0,State,Resident population\n(000),Total traffic\nfatalities,Fatalities,Percent of total\ntraffic fatalities,"Fatalities \n100,000 population"
count,52,52,52,52,52.0,52.0
unique,52,52,52,28,35.0,36.0
top,Alabama,5074,988,15,2.1,0.13
freq,1,1,1,6,4.0,3.0


### Notes: Some of the multiple non-unique values may need to be evaluated and/or removed, column names will have to be changed

In [99]:
df3.shape

(52, 6)

In [100]:
df3.dtypes

State                                   object
Resident population\n(000)              object
Total traffic\nfatalities               object
Fatalities                              object
Percent of total\ntraffic fatalities    object
Fatalities  \n100,000 population        object
dtype: object

### Notes: Numerical objects will need to be typecast in order to perform calculations with them

In [101]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 52 entries, 0 to 51
Data columns (total 6 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   State                                52 non-null     object
 1   Resident population
(000)            52 non-null     object
 2   Total traffic
fatalities             52 non-null     object
 3   Fatalities                           52 non-null     object
 4   Percent of total
traffic fatalities  52 non-null     object
 5   Fatalities  
100,000 population      52 non-null     object
dtypes: object(6)
memory usage: 2.8+ KB


### Notes: No missing/empty cells

## df4

In [102]:
df4.describe()

Unnamed: 0,City (2),Resident\npopulation,Total traffic\nfatalities,Fatalities,As a percent\nof total\ntraffic fatalities,Total,Pedalcyclist,Pedalcyclist\nrank (3)
count,37,37,37,37,37.0,37.0,37.0,37
unique,37,37,33,14,31.0,37.0,32.0,36
top,"New York, NY",8335897,228,2,2.2,2.86,0.52,36
freq,1,1,2,6,3.0,1.0,2.0,2


### Notes: Multiple reoccurring values may have to be cleaned, column names will need to be changed + some columns will need to be dropped due to being unecessary for my study

In [103]:
df4.shape

(37, 8)

In [104]:
df4.dtypes

City (2)                                      object
Resident\npopulation                          object
Total traffic\nfatalities                     object
Fatalities                                    object
As a percent\nof total\ntraffic fatalities    object
Total                                         object
Pedalcyclist                                  object
Pedalcyclist\nrank (3)                        object
dtype: object

### Note: Numerical objects may need to be typecasted as int/double to perform mathematical calculations on

In [105]:
df4.info()

<class 'pandas.core.frame.DataFrame'>
Index: 37 entries, 0 to 36
Data columns (total 8 columns):
 #   Column                                    Non-Null Count  Dtype 
---  ------                                    --------------  ----- 
 0   City (2)                                  37 non-null     object
 1   Resident
population                       37 non-null     object
 2   Total traffic
fatalities                  37 non-null     object
 3   Fatalities                                37 non-null     object
 4   As a percent
of total
traffic fatalities  37 non-null     object
 5   Total                                     37 non-null     object
 6   Pedalcyclist                              37 non-null     object
 7   Pedalcyclist
rank (3)                     37 non-null     object
dtypes: object(8)
memory usage: 3.6+ KB


### Note: No missing or N/A values

## df5

In [106]:
df5.describe()

Unnamed: 0,Year,Total,Number,Percent\nof total,Total.1,Number.1,Percent\nof total.1
count,5,5,5,5,5,5,5
unique,5,5,5,5,5,4,3
top,2014,588,73,12%,93700,24800,26
freq,1,1,1,1,1,2,3


### Note: Multiple non-unique values may need to be cleaned

In [107]:
df5.shape

(5, 7)

In [108]:
df5.dtypes

Year                 object
Total                object
Number               object
Percent\nof total    object
Total                object
Number               object
Percent\nof total    object
dtype: object

### Note: Numerical objects may need to be typecasted for mathematical operations

In [109]:
df5.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 0 to 4
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Year              5 non-null      object
 1   Total             5 non-null      object
 2   Number            5 non-null      object
 3   Percent
of total  5 non-null      object
 4   Total             5 non-null      object
 5   Number            5 non-null      object
 6   Percent
of total  5 non-null      object
dtypes: object(7)
memory usage: 320.0+ bytes


### Note: No N/A or Missing values that need to be cleaned