## Python Workshop: Part 1

Welcome! Here you will find a crash course in Python aimed at researchers and anyone else who works with data.

This is a Jupyter Notebook - it allows you to run Python code, write text and tables in Markdown, and view image and other outputs. This is especially useful for exploratory data analysis, developing machine learning models, and any other interactive tasks where you want to run code in sections or iterate on your work without running a whole program.

Some helpful keyboard shortcuts for working in Jupyter notebooks: https://digitalhumanities.hkust.edu.hk/tutorials/jupyter-notebook-tips-and-shortcuts/

We will be focusing on common data manipulation tasks in Pandas, this focuses on solutions without too many low-level details, so be sure to check out the documentation if you want to learn more about the underlying functionality: https://pandas.pydata.org/docs/index.html

In [None]:
import pandas as pd  # data frames, tabular data read/write
import numpy as np  # linear algebra, math functions
import seaborn as sns  # plotting

from datetime import (
    datetime,
)  # we are only importing the datetime class from the datetime library

### Types and Variables (this is a markdown cell)

In [None]:
# this is a comment (this is a code cell)

"""
This is a multi-line
comment
"""

# types of variables

# integer
a = 3
# float (up to 64-bit)
b = 2.55
# string
d = "boat"
# boolean (True, False) ~ (1, 0)
f = True
g = False

### Intro to DataFrames

In [None]:
# basics
important_data = {
    "name": ["dom", "brian", "mia", "han"],
    "car": ["charger", "skyline", "integra", "RX-7"],
    "how_fast": [7, 8, 5, 9],
    "how_furious": [9, 3, 4, 2],
}

# this is a data frame, you can fill it with any data that is table-like
important_df = pd.DataFrame(important_data)

important_df.head()

In [None]:
# data frames are objects with many METHODS, these are functions we access like: df.method()

# describe returns summary statistics for numeric types
important_df.describe()

In [None]:
# there is a built-in plotting method...
important_df.plot(
    x="name", kind="bar", title="The disposition of various family members"
)

In [None]:
# data frames also have many ATTRIBUTES, these are data, not functions, so no parentheses: df.attribute
important_df.dtypes

In [None]:
important_df.shape

In [None]:
# columns can be accessed with square brackets or attribute notation
print(important_df["name"])
print(important_df.name)

### Reading data from a file

In [None]:
"""
DATA.gov provides open source government data from Federal, State, County, and City agencies

Let's take a look at a dataset from the Consumer Financial Protection Bureau

https://catalog.data.gov/dataset/college-credit-card-marketing-agreements-data

As required by the Credit CARD Act of 2009, we collect information annually from credit card issuers who
have marketing agreements with universities, colleges, or affiliated organizations such as alumni associations,
sororities, fraternities, and foundations.

Data dictionary: https://files.consumerfinance.gov/f/documents/cfpb_college-credit-card-data-guide_2022.pdf

"""

# you can use a URL to a file or a file path on your computer
file_location = "https://files.consumerfinance.gov/f/documents/cfpb_college-credit-card-agreements-database-2009-2019.csv"

# read_csv is for reading csv files, there's also read_excel, etc... with many adjustable parameters to match your file format
df = pd.read_csv(file_location)

In [None]:
# head (tail) shows the first (last) 5 lines of the dataframe (unless you specify a number: e.g. df.tail(8))
df.head()

In [None]:
# Info shows you the types and non-missing counts of data in each column, plus memory usage
df.info()

In [None]:
# what's up with the "object" columns??? Mixed types, usually!
df[
    "PAYMENTS BY ISSUER"
].value_counts()  # This is WACK and a bad practice (mixing types, that is). Let's fix it!

# we can fix it by reassigning the column using to_numeric (coerce makes non-numbers NaN)
# pd.to_numeric(df.payments_by_issuer, errors='coerce')

In [None]:
# here is a way to get the unique non-numeric values...
"""
Breaking this down:
df['PAYMENTS BY ISSUER']: our column
pd.to_numeric(df['PAYMENTS BY ISSUER'], errors='coerce').isnull(): tries to convert to number, null if not, returns true for isnull
['PAYMENTS BY ISSUER'].unique(): grabbing unique values from our original column
"""

# let's add this to our read_csv instead (see top) along with specific dtypes
na_values = df["PAYMENTS BY ISSUER"][
    pd.to_numeric(df["PAYMENTS BY ISSUER"], errors="coerce").isnull()
].unique()

# we can add a list of na_values as well as a dict of dtypes
df = pd.read_csv(
    file_location, na_values=na_values, dtype={"PAYMENTS BY ISSUER": "float64"}
)

In [None]:
# looks as expected now!
df.info()

### Cleaning up column names and values

In [None]:
# let's clean up these column names
col_rename = {
    "REPORTING YEAR": "year",
    "INSTITUTION OR ORGANIZATION": "institution",
    "TYPE OF INSTITUTION OR ORGANIZATION": "institution_type",
    "CREDIT CARD ISSUER": "issuer",
    "CITY": "city",
    "STATE": "state",
    "STATUS": "status",
    "IN EFFECT AS OF BEGINNING OF NEXT YEAR": "in_effect_ny",
    "TOTAL OPEN ACCOUNTS AS OF END OF REPORTING YEAR": "total_open_acct_eoy",
    "PAYMENTS BY ISSUER": "payments_by_issuer",
    "NEW ACCOUNTS OPENED IN REPORTING YEAR": "new_acct_ry",
}

# rename columns like so, using inplace=True to modify your existing df, inplace=False returns a new df
df.rename(columns=col_rename, inplace=True)

# note this won't complain if you spell something wrong, so always good to check

In [None]:
# much better!
df.head(10)

### GroupBy

In [None]:
df.groupby(["state"]).agg({"new_acct_ry": "sum"})

You usually end up having to fix data issues as-you-go, so make sure you go back and update prior code as needed

In [None]:
# let's fix Texas and Utah and Massachusetts using the replace method and a dict {"original": "new"}
df.state = df.state.replace({"Texas": "TX", "Utah": "UT", "Ma": "MA"})

Exercise:  Top 10 States by new accounts opened

In [None]:
# let's sort the results this time
df.groupby(["state"]).agg({"new_acct_ry": "sum"}).sort_values(
    by="new_acct_ry", ascending=False
).head(10)

In [None]:
# let's make a chart of the top 10 states
state_new_accounts = (
    df.groupby(["state"])
    .agg({"new_acct_ry": "sum"})
    .sort_values(by="new_acct_ry", ascending=False)
    .reset_index()
    .head(10)
)

# let's use seaborn barplot (you need to provide a df, and names of x and y axes)
sns.barplot(data=state_new_accounts, x="state", y="new_acct_ry")

#### Exercise: Make a plot of total_open_acct_eoy by year

In [None]:
# let's make a chart of the top 10 states
year_new_accounts = (
    df.groupby(["year"])
    .agg({"total_open_acct_eoy": "sum"})
    .sort_values(by="total_open_acct_eoy", ascending=False)
    .reset_index()
    .head(10)
)

# let's use seaborn pointplot (you need to provide a df, and names of x and y axes)
sns.pointplot(data=year_new_accounts, x="year", y="total_open_acct_eoy")

In [None]:
# groupby with multiple aggregation functions and sorting multi-index
df.groupby(["state"]).agg({"new_acct_ry": ["sum", "count"]}).sort_values(
    [("new_acct_ry", "sum")], ascending=False
).head(10)

### Querying and indexing

There are many ways to filter and select data in pandas, query is best if you want to use an expression, indices are best for specifying columns and rows

In [None]:
# we can use string methods to filter text columns (startswith, contains, endswith)
df.query("institution.str.contains('City College')")

In [None]:
# we can use string methods to filter text columns (startswith, contains, endswith)
df.query("institution.str.contains('Rutgers') and year > 2016")

In [None]:
# we can use string methods to filter text columns (startswith, contains, endswith)
df.query("year == 2015 and city == 'New York'")

In [None]:
# you can get the same result with indexing and booleans but I think this is more annoying to type and read
df[(df["year"] == 2015) & (df["city"] == "New York")]

In [None]:
# iloc is your friend if you want to select data by position i.e. column or row number
# here we are selecting ranges of columns and rows, i.e. df.iloc[rows,columns]
df.iloc[2:8, 3:8]

### Method chaining

It's easy to make mistakes in notebooks by running cells out of order. One way I like to keep my code clean is by using method chaining in a single cell instead of spreading code out over several cells. This allows us to combine multiple dataframe methods in a pipeline of sorts.

In [None]:
# example: top ten cities in Texas according to sum of "payments by issuer"
df.query("state == 'TX'").groupby("city").agg(
    {"payments_by_issuer": "sum"}
).sort_values("payments_by_issuer", ascending=False).iloc[:10]