# **MSAS Tutorial 2**
##**Authored by Benjamin (Ben) Riela**

Today we will be working with Dataframes, the main way data is processed and manipulated in Python analytics

The first step we need to do is import the pandas library, which is the library used for working with dataframes. This library is essential to most Python analytics and for this module series

"Import" simply means bring in the pandas library, and we are calling this 'pd' (industry naming convention)

In [1]:
import pandas as pd

Click on the link: what do you see?
What do you think CSV means, why would it be used to store data?

In [2]:
#reads csv file into dataframe
df = pd.read_csv("https://raw.githubusercontent.com/rielaben/MSAS_Modules_2021_2022/main/Week%202/nfl_passing_2020.csv")

print(df)

      Rk            Player   Tm  Age  Pos  ...   Y/C    Y/G   Rate   QBR  Sk
0      1   Deshaun Watson*  HOU   25   QB  ...  12.6  301.4  112.4  70.5  49
1      2  Patrick Mahomes*  KAN   25   QB  ...  12.2  316.0  108.2  82.9  22
2      3         Tom Brady  TAM   43   QB  ...  11.6  289.6  102.2  72.5  21
3      4         Matt Ryan  ATL   35   QB  ...  11.3  286.3   93.3  67.0  41
4      5       Josh Allen*  BUF   24   QB  ...  11.5  284.0  107.2  81.7  26
..   ...               ...  ...  ...  ...  ...   ...    ...    ...   ...  ..
107  108        Brett Kern  TEN   34  NaN  ...   NaN    0.0   39.6   0.0   0
108  109        D.J. Moore  CAR   23   WR  ...   NaN    0.0   39.6   NaN   0
109  110       Zach Pascal  IND   26   WR  ...   NaN    0.0   39.6   2.5   0
110  111     Sammy Watkins  KAN   27   WR  ...   NaN    0.0    0.0   0.0   0
111  112     Isaiah Wright  WAS   23   wr  ...   NaN    0.0   39.6   3.8   0

[112 rows x 20 columns]


Display shows output in a nice format, we use this more often than just calling "print"

In [3]:
display(df.head(3))

Unnamed: 0,Rk,Player,Tm,Age,Pos,G,GS,Cmp,Att,Cmp%,Yds,TD,Int,1D,Y/A,Y/C,Y/G,Rate,QBR,Sk
0,1,Deshaun Watson*,HOU,25,QB,16,16,382,544,70.2,4823,33,7,221,8.9,12.6,301.4,112.4,70.5,49
1,2,Patrick Mahomes*,KAN,25,QB,15,15,390,588,66.3,4740,38,6,238,8.1,12.2,316.0,108.2,82.9,22
2,3,Tom Brady,TAM,43,QB,16,16,401,610,65.7,4633,40,12,233,7.6,11.6,289.6,102.2,72.5,21


Now we can start to dig into the actual data
What do you think this code below does?

In [4]:
df.shape

(112, 20)

This outputs all of the columns present in the dataframe:

In [5]:
df.columns

Index(['Rk', 'Player', 'Tm', 'Age', 'Pos', 'G', 'GS', 'Cmp', 'Att', 'Cmp%',
       'Yds', 'TD', 'Int', '1D', 'Y/A', 'Y/C', 'Y/G', 'Rate', 'QBR', 'Sk'],
      dtype='object')

What do you think this means?

In [6]:
df.dtypes

Rk          int64
Player     object
Tm         object
Age         int64
Pos        object
G           int64
GS          int64
Cmp         int64
Att         int64
Cmp%      float64
Yds         int64
TD          int64
Int         int64
1D          int64
Y/A       float64
Y/C       float64
Y/G       float64
Rate      float64
QBR       float64
Sk          int64
dtype: object

Calling .tail() shows the last n rows of your dataframe:

What looks weird in the table below?

In [7]:
display(df.tail(20))

Unnamed: 0,Rk,Player,Tm,Age,Pos,G,GS,Cmp,Att,Cmp%,Yds,TD,Int,1D,Y/A,Y/C,Y/G,Rate,QBR,Sk
92,93,Greg Ward,PHI,25,wr,16,10,1,1,100.0,15,0,0,1,15.0,15.0,0.9,118.7,0.9,0
93,94,Kendall Hinton,DEN,23,,1,0,1,9,11.1,13,0,2,1,1.4,13.0,13.0,0.0,0.1,1
94,95,Jaquan Johnson,BUF,25,,14,0,1,1,100.0,13,0,0,1,13.0,13.0,0.9,118.7,,0
95,96,Tommy Townsend,KAN,24,,16,0,1,1,100.0,13,0,0,1,13.0,13.0,0.8,118.7,13.9,0
96,97,Isaiah McKenzie,BUF,25,wr,16,7,1,1,100.0,12,1,0,1,12.0,12.0,0.8,156.2,91.0,0
97,98,Logan Woodside,TEN,25,,6,0,1,3,33.3,7,0,0,1,2.3,7.0,1.2,42.4,71.7,0
98,99,Travis Kelce*+,KAN,31,TE,15,15,1,2,50.0,4,0,0,1,2.0,4.0,0.3,56.2,62.7,0
99,100,Easton Stick,LAC,25,,1,0,1,1,100.0,4,0,0,0,4.0,4.0,4.0,83.3,1.8,0
100,101,Joshua Dobbs,PIT,25,,1,0,4,5,80.0,2,0,0,0,0.4,0.5,2.0,79.2,56.4,0
101,102,Jamal Agnew,DET,25,,14,2,0,1,0.0,0,0,0,0,0.0,,0.0,39.6,1.3,0


* In following modules, we will talk more about these irregularities in the data, and how to clean them before doing analyses

In [8]:
display(type(df))
display(type(df['Rate']))

pandas.core.frame.DataFrame

pandas.core.series.Series

This is how you would isolate a column:

In [9]:
players_column = df[['Player']]
players_column.head()

Unnamed: 0,Player
0,Deshaun Watson*
1,Patrick Mahomes*
2,Tom Brady
3,Matt Ryan
4,Josh Allen*


This is how to isolate multiple columns:

In [10]:
new_df = df[['Player', 'Tm', 'Pos']]
new_df

Unnamed: 0,Player,Tm,Pos
0,Deshaun Watson*,HOU,QB
1,Patrick Mahomes*,KAN,QB
2,Tom Brady,TAM,QB
3,Matt Ryan,ATL,QB
4,Josh Allen*,BUF,QB
...,...,...,...
107,Brett Kern,TEN,
108,D.J. Moore,CAR,WR
109,Zach Pascal,IND,WR
110,Sammy Watkins,KAN,WR


Practice: create a dataframe only showing information for players, their completions, yards, touchdowns, and sacks.

You can filter by certain qualifications by using the '[' and ']' operators

In [11]:
# df[df[''] < ]
# df[df[''] > ]
df[df[''] == ]

SyntaxError: ignored

How would we isolate Tom Brady's stats for lat season?

Isolate Taysom Hill's stats - what looks weird?

How would we filter so we are just looking at Wide Recievers?

https://www.youtube.com/watch?v=WcX12X4Xw0c

Look up the sort_values function: Try to apply this to the dataframe:

Who had the most passing yards last season?
Who had the most sacks?
Who had the most interceptions?


Who had the highest completion percentage? What can we do to get a more accurate result?

Go mess around with this data - try out new things! Google is your best friend: if you get an error or don't know how to do something, ALWAYS Google it first.