Pandas Example

In [None]:
'''
Resources:
10 minute intro to Pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
Cheatsheet with most common commands: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
'''

In [None]:
import requests
import lxml.html as lh
import pandas as pd

There are thousands of different Python 'libraries' available for people to use - packages of Python code made publicly available by other people. These packages (like pandas, numpy, scipy) are extremely useful and are made for different purposes.

Requests - A library to easily send HTTP requests through Python
lxml.html - Based on lxml's HTML parser (code to search through the code (HyperText Markup Language) that forms a website)
Pandas - Open source data analysis and manipulation tool, most well-known for use of dataframes (tables)

In [None]:
url = 'https://www.pro-football-reference.com/years/2021/passing.htm'
df = pd.read_html(url)[0]

In [None]:
df

What is this?

In [None]:
print(type(df))

What can we do with it? Think of it like a table.

We can take rows

In [None]:
first_row = df.iloc[0,:]

print(first_row)

In [None]:
# to make viewing rows and columns a little easier
pd.set_option('display.max_columns', None)

df.iloc[0:1,:]

What is a "row" in Pandas?

In [None]:
print(type(first_row))

It's a Pandas Series object! It's basically an array (a list) with indices (labels for each position in the list)

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series

We can select elements of a Pandas series by their label. In the above example, if we are interested in the value of the "Player" index, we can do:

In [None]:
print(first_row.loc['Player'])

These are the two ways we select things in a Pandas object: by label (loc), or by index (iloc)

Just like we can select rows, we can also select columns

In [None]:
df.head()

In [None]:
first_column = df.loc[:,"Player"]

print(first_column)

Conceptual question: What type is "first_column"? I.e. what would be the output of type(first_column)?

There is another very important concept: boolean indexing. Let's look at a small example

In [None]:
small_df = df.head()

small_df

What if I wanted all the elements except the first one?

In [None]:
small_df.loc[[False, True, True, True, True], :]

Why can we do this!? When we are 'selecting' rows using the .loc command, this is really what Pandas is seeing - for every row, it has to decide if it wants to keep that row (True) or not (False). 

As one might imagine, writing out the entire list of boolean values would be horribly inefficient. So usually there are quicker ways we can do so

In [None]:
small_df.iloc[small_df.index != 0, :]

What's going on here!? Well first what is small_df.index?

In [None]:
print(type(small_df.index))

In [None]:
print(small_df.index)

For all intents and purposes, it's basically a list. But it's a list! And zero is a number! Why can we do 'list != number' and not get an error!?

This is (basically) called broadcasting - Pandas assumes that what you really meant is "compare each element of list with 'number'", e.g. 'list != [0,0,0,0,0]'

In [None]:
print(small_df.index != 0)

Let's sort this by 'Rate' to see who are some of the best quarterbacks

In [None]:
#pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

sorted_df = df.sort_values('Rate', ascending=False)

sorted_df

Why did we get a weird output? Why is Josh Johnson the highest?

In [None]:
print(df['Rate'].dtype)

Python is not reading 'Rate' as a number, instead it is reading it as a 'python object'. So if we want to sort it properly (i.e. the largest numbers being at the top, we have to change the column to being a 'numeric' data type.

In [None]:
# this will change the one column to be a numeric type
# we have to save the column
df['Rate'] = pd.to_numeric(df['Rate'], errors='coerce')

In [None]:
print(df['Rate'].dtype)

In [None]:
# This is how we would do it if we wanted to change mulitple columns to numeric type. The "errors = 'coerce'" part
# isn't important, the interesting part is using the 'apply' function, a very useful python function

columns_to_make_numeric = ['G', 'GS', 'Cmp', 'Att', 'Y/G', 'Rate', 'QBR', 'Sk']

df[columns_to_make_numeric] = df[columns_to_make_numeric].apply(pd.to_numeric, errors='coerce')

In [None]:
print(df['G'].dtype)

In [None]:
correctly_sorted_df = df.sort_values('Rate', ascending=False)

correctly_sorted_df

Great! Now we have a sorted dataframe. But we probably don't care about these quarterbacks who've made so few completions right? Let's get rid of them

In [None]:
filtered_df = correctly_sorted_df.loc[correctly_sorted_df['Att'] >= 20, :]

filtered_df