<h1><center>Access Data From A DataFrame</center></h1>

## Prep: Import pandas package and read the dataset into a DataFrame.

The `read_csv()` function can also understand the csv files hosted on the Internet. You just need to supply the URL to the file.

For example, given the same file hosted on Github: https://raw.githubusercontent.com/BlueJayADAL/CS121/main/datasets/nfl_height_weight.csv

In [4]:
import pandas as pd

In [5]:
file_url = 'https://raw.githubusercontent.com/BlueJayADAL/CS121/main/datasets/nfl_height_weight.csv'
df = pd.read_csv(file_url)



In [6]:
# Show the top 5 rows from the dataframe
df.head()

Unnamed: 0,number,full_name,position,height_in_inches,weight_in_lbs,date_of_birth,team
0,23,"Alford, Robert",CB,70,186,11/1/1988,ATL
1,95,"Babineaux, Jonathan",DT,74,300,10/12/1981,ATL
2,72,"Baker, Sam",T,77,301,5/30/1985,ATL
3,59,"Bartu, Joplo",OLB,74,230,10/3/1990,ATL
4,71,"Biermann, Kroy",OLB,75,255,9/12/1985,ATL


## Section 1: Access the columns or rows from a DataFrame

#### Access a single column

Subscript a DataFrame by a column name to get the contents of that column. It's very similar to accessing values from a dictionary.

In [7]:
# Select only the position column

df['position']


0        CB
1        DT
2         T
3       OLB
4       OLB
       ... 
1869      G
1870     OT
1871    OLB
1872     RB
1873     QB
Name: position, Length: 1874, dtype: object

#### Access a single row (entry) with positional index

Subscript a DataFrame by a row index with the `.iloc` attribute to get the contents of that row.

In [8]:

df.iloc[0]


number                          23
full_name           Alford, Robert
position                        CB
height_in_inches                70
weight_in_lbs                  186
date_of_birth            11/1/1988
team                           ATL
Name: 0, dtype: object

####  Access a single row with a descriptive row name.

Run the cell below to have a different looking DataFrame that has player's full name as the row index.

In [9]:
df2 = df.set_index('full_name')


df2.head()

Unnamed: 0_level_0,number,position,height_in_inches,weight_in_lbs,date_of_birth,team
full_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Alford, Robert",23,CB,70,186,11/1/1988,ATL
"Babineaux, Jonathan",95,DT,74,300,10/12/1981,ATL
"Baker, Sam",72,T,77,301,5/30/1985,ATL
"Bartu, Joplo",59,OLB,74,230,10/3/1990,ATL
"Biermann, Kroy",71,OLB,75,255,9/12/1985,ATL


Now we can subscript the DataFrame `df2` by a row name with the `.loc` attribute to get the contents of that row:
`df2.loc['Alford, Robert']`


In [10]:

df2.loc['Alford, Robert']


number                     23
position                   CB
height_in_inches           70
weight_in_lbs             186
date_of_birth       11/1/1988
team                      ATL
Name: Alford, Robert, dtype: object

It's worth mentioning that the positional row indices still exist. 

In [11]:

df2.iloc[0]


number                     23
position                   CB
height_in_inches           70
weight_in_lbs             186
date_of_birth       11/1/1988
team                      ATL
Name: Alford, Robert, dtype: object

#### Access multiple discrete columns

What if we want to access multiple columns? No problem, provide a `list` of column names in the subscript

In [12]:

df[['height_in_inches', 'weight_in_lbs']]


Unnamed: 0,height_in_inches,weight_in_lbs
0,70,186
1,74,300
2,77,301
3,74,230
4,75,255
...,...,...
1869,75,303
1870,78,314
1871,73,237
1872,71,215


#### Access multiple discrete rows

What if we want to access multiple rows? Likewise, provide a `list` of column names to the `iloc[]` or `loc[]` attribute

In [13]:

df.iloc[[0, 2, 10]]


Unnamed: 0,number,full_name,position,height_in_inches,weight_in_lbs,date_of_birth,team
0,23,"Alford, Robert",CB,70,186,11/1/1988,ATL
2,72,"Baker, Sam",T,77,301,5/30/1985,ATL
10,15,"Cone, Kevin",WR,74,216,3/20/1988,ATL


#### Exercise: Grab the same rows out as above from `df2` but by using `loc[]`

In [14]:

df2.loc[['Alford, Robert', 'Baker, Sam', 'Cone, Kevin']]


Unnamed: 0_level_0,number,position,height_in_inches,weight_in_lbs,date_of_birth,team
full_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Alford, Robert",23,CB,70,186,11/1/1988,ATL
"Baker, Sam",72,T,77,301,5/30/1985,ATL
"Cone, Kevin",15,WR,74,216,3/20/1988,ATL


## Section 2: Slice a contiguous range of data in DataFrames

#### Slice DataFrame `df` with row indices range from 10 to 19. 

*Note: the range rule still applies for `iloc[]`: inclusive for the beginning index and exclusive for the ending index*

In [15]:

df.iloc[10:20]


Unnamed: 0,number,full_name,position,height_in_inches,weight_in_lbs,date_of_birth,team
10,15,"Cone, Kevin",WR,74,216,3/20/1988,ATL
11,4,"Davis, Dominique",QB,75,210,7/17/1989,ATL
12,19,"Davis, Drew",WR,73,205,1/4/1989,ATL
13,28,"DeCoud, Thomas",FS,74,192,3/19/1985,ATL
14,52,"Dent, Akeem",MLB,73,239,9/27/1987,ATL
15,42,"DiMarco, Patrick",FB,73,234,4/30/1989,ATL
16,83,"Douglas, Harry",WR,72,183,9/16/1984,ATL
17,34,"Ewing, Bradie",FB,71,243,12/26/1989,ATL
18,24,"Franks, Dominique",DB,72,197,10/8/1987,ATL
19,53,"Gaither, Omar",MLB,74,235,3/18/1984,ATL


#### Slice DataFrame `df2` with player's full name from `Cone, Kevin` to `Gaither, Omar`

*Note: the range rule does NOT apply for `loc[]` since you are using descriptive indices*

In [16]:

df2.loc['Cone, Kevin':'Gaither, Omar']


Unnamed: 0_level_0,number,position,height_in_inches,weight_in_lbs,date_of_birth,team
full_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Cone, Kevin",15,WR,74,216,3/20/1988,ATL
"Davis, Dominique",4,QB,75,210,7/17/1989,ATL
"Davis, Drew",19,WR,73,205,1/4/1989,ATL
"DeCoud, Thomas",28,FS,74,192,3/19/1985,ATL
"Dent, Akeem",52,MLB,73,239,9/27/1987,ATL
"DiMarco, Patrick",42,FB,73,234,4/30/1989,ATL
"Douglas, Harry",83,WR,72,183,9/16/1984,ATL
"Ewing, Bradie",34,FB,71,243,12/26/1989,ATL
"Franks, Dominique",24,DB,72,197,10/8/1987,ATL
"Gaither, Omar",53,MLB,74,235,3/18/1984,ATL


#### You can even use `loc[]` and `iloc[]` to select a subset of columns from the sliced data. Just use a comma to separate the row part and the column part.

In [17]:
# When using iloc[], both row indices and column names must be positional

df2.iloc[10:19, 2:5]



Unnamed: 0_level_0,height_in_inches,weight_in_lbs,date_of_birth
full_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Cone, Kevin",74,216,3/20/1988
"Davis, Dominique",75,210,7/17/1989
"Davis, Drew",73,205,1/4/1989
"DeCoud, Thomas",74,192,3/19/1985
"Dent, Akeem",73,239,9/27/1987
"DiMarco, Patrick",73,234,4/30/1989
"Douglas, Harry",72,183,9/16/1984
"Ewing, Bradie",71,243,12/26/1989
"Franks, Dominique",72,197,10/8/1987


#### Exercise: Slice the same data out as above from `df2` but by using `loc[]`

In [18]:

df2.loc['Cone, Kevin':'Gaither, Omar', 'position':'weight_in_lbs']


Unnamed: 0_level_0,position,height_in_inches,weight_in_lbs
full_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Cone, Kevin",WR,74,216
"Davis, Dominique",QB,75,210
"Davis, Drew",WR,73,205
"DeCoud, Thomas",FS,74,192
"Dent, Akeem",MLB,73,239
"DiMarco, Patrick",FB,73,234
"Douglas, Harry",WR,72,183
"Ewing, Bradie",FB,71,243
"Franks, Dominique",DB,72,197
"Gaither, Omar",MLB,74,235


## Section 3: Data filtering (boolean selection) in DataFrames

#### Find the players that are more than 80 inches tall.

In [19]:
# Step1: Create a boolean filter
bool_filter = df['height_in_inches'] > 80


# Show bool_filter
bool_filter

0       False
1       False
2       False
3       False
4       False
        ...  
1869    False
1870    False
1871    False
1872    False
1873    False
Name: height_in_inches, Length: 1874, dtype: bool

In [20]:
# Step2: Place the boolean filter in subscript to select 
#        only the data with boolean being True

df[bool_filter]


Unnamed: 0,number,full_name,position,height_in_inches,weight_in_lbs,date_of_birth,team
1351,77,"Dunlap, King",T,81,330,9/14/1985,SD
1593,69,"Dotson, Demar",T,81,315,10/11/1985,TB


#### Exercise: Find the players that are less than 170 lbs in weight

In [21]:
bool_filter = df['weight_in_lbs'] < 170



In [22]:

df[bool_filter]


Unnamed: 0,number,full_name,position,height_in_inches,weight_in_lbs,date_of_birth,team
98,37,"Robey, Nickell",DB,67,165,1/17/1992,BUF


#### Exercise: Find the players from your favorite team!

In [23]:
bool_filter = df['team'] == 'BAL'



In [24]:
df[bool_filter]



Unnamed: 0,number,full_name,position,height_in_inches,weight_in_lbs,date_of_birth,team
292,59,"Brown, Arthur",ILB,72,235,6/17/1990,BAL
293,23,"Brown, Chykie",DB,71,190,12/26/1986,BAL
294,14,"Brown, Marlon",WR,76,205,4/22/1991,BAL
295,49,"Bryant, D.J.",LB,75,248,3/3/1989,BAL
296,56,"Bynes, Josh",ILB,73,240,8/24/1989,BAL
297,99,"Canty, Chris",DE,79,317,11/10/1982,BAL
298,87,"Clark, Dallas",TE,75,252,6/12/1979,BAL
299,62,"Cody, Terrence",NT,76,340,6/28/1988,BAL
300,46,"Cox, Morgan",LS,76,241,4/26/1986,BAL
301,84,"Dickson, Ed",TE,76,255,7/25/1987,BAL


#### Exercise: Pinpoint your favorite player - `Brady, Tom`!

In [25]:
# Old data alert!
# Now should be in team 'Tampa Bay Buccaneers' - TB

bool_filter = df['full_name'] == 'Brady, Tom'
df[bool_filter]

Unnamed: 0,number,full_name,position,height_in_inches,weight_in_lbs,date_of_birth,team
995,12,"Brady, Tom",QB,76,225,8/3/1977,NE


#### How to apply multiple conditions for the data filtering?

For example, find all players whose height are over 78 inches from `PHI` Eagles.

In [26]:
bool_filter1 = df['height_in_inches'] > 78


bool_filter2 = df['team'] == 'PHI'



In [27]:
# filter_final = bool_filter1 and bool_filter2 will fail

filter_final = bool_filter1 & bool_filter2


In [28]:
df[filter_final]


Unnamed: 0,number,full_name,position,height_in_inches,weight_in_lbs,date_of_birth,team
1245,90,"Geathers, Clifton",DE,80,340,12/11/1987,PHI
1257,67,"Kelly, Dennis",T,80,321,1/16/1990,PHI


## Section 4: Handle Missing Data

Given the dataset from following URL (run the cell below):

In [29]:
url = 'https://raw.githubusercontent.com/BlueJayADAL/DS200/master/datasets/weather_data.csv'

Read in the data into a DataFrame named `weather`

In [30]:
weather = pd.read_csv(url)



In [31]:
weather

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,9.0,Sunny
2,1/5/2017,28.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain
5,1/8/2017,,,Sunny
6,1/9/2017,,,
7,1/10/2017,34.0,8.0,Cloudy
8,1/11/2017,40.0,12.0,Sunny


#### Show the missing data with `info()` method

In [32]:
weather.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   day          9 non-null      object 
 1   temperature  5 non-null      float64
 2   windspeed    5 non-null      float64
 3   event        7 non-null      object 
dtypes: float64(2), object(2)
memory usage: 416.0+ bytes


#### Show how many data are missing from each column.

In [33]:
weather.isnull().sum()



day            0
temperature    4
windspeed      4
event          2
dtype: int64

#### Drop any rows with at least 1 missing data

In [34]:
weather.dropna()



Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
7,1/10/2017,34.0,8.0,Cloudy
8,1/11/2017,40.0,12.0,Sunny


*Note: the change is NOT happening in place unless specify `inplace=True`*

In [35]:
# weather still has 9 entries since last line 
# of code didn't actually change the data

weather

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,9.0,Sunny
2,1/5/2017,28.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain
5,1/8/2017,,,Sunny
6,1/9/2017,,,
7,1/10/2017,34.0,8.0,Cloudy
8,1/11/2017,40.0,12.0,Sunny


#### Drop any columns with at least 1 missing data

In [36]:
weather.dropna(axis = 1)



Unnamed: 0,day
0,1/1/2017
1,1/4/2017
2,1/5/2017
3,1/6/2017
4,1/7/2017
5,1/8/2017
6,1/9/2017
7,1/10/2017
8,1/11/2017


#### Set a dropping threshold. Keep only the rows that have at least the threshold number of non-missing values.

In [39]:
weather.dropna(thresh = 3)



Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,9.0,Sunny
2,1/5/2017,28.0,,Snow
4,1/7/2017,32.0,,Rain
7,1/10/2017,34.0,8.0,Cloudy
8,1/11/2017,40.0,12.0,Sunny


#### Fill all NaNs with one specific value, e.g. 0.

In [40]:
weather.fillna(value = 0)




Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,0.0,9.0,Sunny
2,1/5/2017,28.0,0.0,Snow
3,1/6/2017,0.0,7.0,0
4,1/7/2017,32.0,0.0,Rain
5,1/8/2017,0.0,0.0,Sunny
6,1/9/2017,0.0,0.0,0
7,1/10/2017,34.0,8.0,Cloudy
8,1/11/2017,40.0,12.0,Sunny


#### Fill the NaN with 0 just for columns `temperature` and `windspeed`

In [41]:
weather[['temperature', 'windspeed']].fillna(value = 0)



Unnamed: 0,temperature,windspeed
0,32.0,6.0
1,0.0,9.0
2,28.0,0.0
3,0.0,7.0
4,32.0,0.0
5,0.0,0.0
6,0.0,0.0
7,34.0,8.0
8,40.0,12.0


#### However, sometimes 0 is not the best guess. Let's use average temperature and median windspeed to fill the NaNs.

In [42]:
avg_temp = weather['temperature'].mean()

median_wind = weather['windspeed'].median()


print(avg_temp, median_wind)

33.2 8.0


In [50]:
weather['temperature'] = weather['temperature'].fillna(value = avg_temp)

weather['windspeed'] = weather['windspeed'].fillna(value = median_wind)


In [51]:
weather

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,33.2,9.0,Sunny
2,1/5/2017,28.0,8.0,Snow
3,1/6/2017,33.2,7.0,
4,1/7/2017,32.0,8.0,Rain
5,1/8/2017,33.2,8.0,Sunny
6,1/9/2017,33.2,8.0,
7,1/10/2017,34.0,8.0,Cloudy
8,1/11/2017,40.0,12.0,Sunny


#### Propagate NaN values with forward fill

In [52]:
#  propagate NaN with forward fill.
new_df = weather.ffill()

new_df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,33.2,9.0,Sunny
2,1/5/2017,28.0,8.0,Snow
3,1/6/2017,33.2,7.0,Snow
4,1/7/2017,32.0,8.0,Rain
5,1/8/2017,33.2,8.0,Sunny
6,1/9/2017,33.2,8.0,Sunny
7,1/10/2017,34.0,8.0,Cloudy
8,1/11/2017,40.0,12.0,Sunny


#### Propagate NaN values with backward fill

In [54]:
#  propagate NaN with backward fill.
new_df = weather.backfill()

new_df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,33.2,9.0,Sunny
2,1/5/2017,28.0,8.0,Snow
3,1/6/2017,33.2,7.0,Rain
4,1/7/2017,32.0,8.0,Rain
5,1/8/2017,33.2,8.0,Sunny
6,1/9/2017,33.2,8.0,Cloudy
7,1/10/2017,34.0,8.0,Cloudy
8,1/11/2017,40.0,12.0,Sunny
