# Pandas

The pandas module is one of the most powerful tools for data analysis.  Pandas was designed to work with tabular and heterogeneous data.  The original author of pandas is Wes McKinney, so it makes sense that most of his book "Python for Data Analysis" covers the functionality of pandas. In fact, chapters 5 - 11 are basically about what pandas can do.  

Here are some of the things that I hope you can do by the end of the section:
* Create Series and DataFrames (ch 5)
* Index, slice, and filter (ch 5)
* Examine your data (ch 5)
* Compute summarization and descriptive statistics (ch 5)
* Drop rows and columns (ch 5)
* Create columns (ch 5)
* Count the number of missing values (ch 7)
* Drop or fill missing values (ch 7)
* Drop duplicate rows (ch 7)
* Combine categories of categorical data (ch 7)
* Discretize numerical data (ch 7)
* Have some practice with hierarchical indexing (ch 8)
* Reset the index (ch 8)
* Merge and concatenate DataFrames (ch 8)
* Simple plots with pandas (ch 9)
* Use .groupby() for category aggregation (ch 10)
* Fill missing values by group summary statistics (ch 10)

## Importing Pandas

It is standard to use the alias ``pd`` when importing pandas.
~~~
import pandas as pd
~~~
I usually import numpy at the same time since pandas and numpy are often used in tandem.

In [None]:
# Import Pandas library
import pandas as pd
import numpy as np

In [None]:
# Note: you can install pandas within the notebook:
# !pip install pandas
# OR
# !conda install pandas

In [None]:
# Try:  Create a Series from a list
x = [1,2,3,4,5]
lab = ['a','b','c','d','e']

s = pd.Series(x, index=lab)
print(s)

In [None]:
# Creating a Series with a dictionary

d = pd.Series({'a': 1, 'b': 2, 'c': 3})
print(d)

## DataFrames
DataFrames are the main data structure of pandas and were directly inspired by the R programming language.  DataFrames are a bunch of Series objects put together to share the same (row) index.  A DataFrame has both a row and a column index.  

## Creating DataFrames
DataFrames can also be created from lists, dictionaries, or numpy arrays.
Syntax: pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)


In [None]:
x = [[1, 2, 3],
     ['a', 'b', 'c'],
     [4, 5, 6]]

x_df = pd.DataFrame(x, columns = ['p', 'd', 'q'], index = ['x', 'y', 'z'])
print(x_df)

In [None]:
# Create a simple DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'Salary': [50000, 60000, 75000]}

df = pd.DataFrame(data)

# Display the DataFrame
df

In [None]:
# Accessing specific columns
names = df['Name']
ages = df['Age']

# Accessing a specific row
row = df.loc[1]

# Accessing a specific element
salary = df.at[2, 'Salary']


In [None]:
# Display the results
print("Names: \n", names)


In [None]:
print("Ages: \n", ages)

In [None]:
print("Row 1: \n", row)

In [None]:
print("Charlie's Salary:", salary)

In [None]:
# Add a new column calculated from existing columns
df['Birth Year'] = 2023 - df['Age']

# Display the DataFrame with the new column
df.head()


In [None]:
# Sort the DataFrame by Age in descending order
df_sorted = df.sort_values(by='Birth Year', ascending=True)

# Display the sorted DataFrame
df_sorted.head()


## Read in some practice data

pd.read_csv can be used to load in external .csv files  
We can access a summary of the data using df.info()  
We can use df.head() to view the first view entries  

In [None]:
## Iris data
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
iris = pd.read_csv(url, names=['sepal_length','sepal_width', 'petal_length', 'petal_width', 'class'])

## Looking at your DataFrame

``df.head()``  
``df.tail()``  
``df.shape``  
``df.info()``  
``df.describe()``   
``df.columns``

In [None]:
iris.shape

In [None]:
iris.columns

In [None]:
iris.head()

In [None]:
iris.info()

In [None]:
iris.describe()

In [None]:
# Rename columns
iris.rename(columns={'class': 'species'}, inplace=True)

# Display the DataFrame with renamed columns
iris.head()


## Basic Plotting
Pandas can be used for basic plotting, but we will cover more later

In [None]:
iris['sepal_length'].plot.hist(bins=20)

In [None]:
iris.plot.scatter('sepal_length','sepal_width', c='petal_width')

In [None]:
iris.plot.box()

In [None]:
iris.plot.kde()

---

## Selection and Indexing

There are various ways to get subsets of the data.  In the following ``df`` refers to a DataFrame.

#### Selecting columns
One column (producing a Series)
~~~
df['column_name']
df.column_name
~~~
---

Multiple columns (producing a DataFrame)
~~~
df[['column_name']] # this will produce a DataFrame
df[['col1', 'col2', 'col3']]
~~~
---

#### Selecting row and columns with ``loc`` and ``iloc``
~~~
df.loc['row_name', 'col_name']
df.iloc['row index', 'col index']
~~~

``loc`` and ``iloc`` also support slicing.  Note: when slicing with ``loc``, the end point IS including (but not when slicing with ``iloc``.

---
~~~
df.loc['row_name1':'row_name2', 'col_name1':'col_name2']
df.loc[:, 'col_name1':'col_name2']
df.loc['r1':'r2', :]
df.loc[['r1','r2','r3'],['c1','c2]]
~~~
*When using `.loc()`, `row_name2` and `col_name2` WILL be included*

---
~~~
df.iloc[index1:index2, col1:col2]
~~~
*When using `.iloc()`, `index2` and `col2` will NOT be included*

---
#### Selecting rows based on column condition
~~~
df[df[boolean condition]]

df[mask]
~~~


In [None]:
iris.loc[0:5, ['petal_width', 'petal_length']]

In [None]:
iris.iloc[0:2, 0:]

In [None]:
iris['sepal_length'] > 6

In [None]:
## Slicing with a boolean series
iris[iris['sepal_length'] > 6]

In [None]:
# Filter data using multiple conditions (Note the parentheses!)
filtered_iris = iris[(iris['sepal_length'] > 6) & (iris['petal_length'] > 5)]

# Display the filtered data
filtered_iris.head()


In [None]:
# Reset to default 0,1...n index
filtered_iris.reset_index(drop = True).head()

## Multi-Index and Index Hierarchy

Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:

In [None]:
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)

In [None]:
hier_index

In [None]:
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df

Now let's show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:

In [None]:
df.loc['G1']

In [None]:
df.loc['G1'].loc[1]

In [None]:
df.index.names

In [None]:
df.index.names = ['Group','Num']

In [None]:
df

In [None]:
# The xs() method in pandas is used to extract a cross-section from a DataFrame or Series
df.xs('G1')

## Methods for computing summary and descriptive statistics
pandas objects have many reduction / summary statistics methods that extract a single value from the rows or columms of a DataFrame.  See Table 5-8 in *Python for Data Analysis* for a more complete list, but here are a few that are commonly used.

`count`: number of non-NA values   
`describe`: summary statistics for numerical columns   
`min`, `max`: min and max values  
`argmin`, `argmax`: index of min and max values (for Series only)   
`idxmin`, `idxmax`: index or column name of min and max values  
`sum`: sum of values  
`cumsum` : cummulative sum
`mean`: mean of values  
`quantile`: quantile from 0 to 1 of values  
`var`: (sample) variance of values  
`std`: (sample) standard deviation of values  
`df.corr()` and `df.cov()` will produce the correlation or covariance matrix.  Or two Series can be used to get the correlation (or covariance) with `Series1`.corr(`Series2`).

Numpy functions can also be used: `np.corrcoef()`

Most of these functions also take an `axis` argument which specifies whether to reduce over rows or columns: 0 for rows and 1 for columns.   
There is also an argument `skipna` which specifies whether or not to skip missing values.  The default is True.


In [None]:
iris.sepal_length.argmin()

In [None]:
iris.cumsum()

## Unique values and value counts

``df.nunique()`` or ``df['column'].nunique()``  

``df.value_counts()`` or ``df['column'].value_counts()``

In [None]:
iris.nunique()

In [None]:
iris.species.unique()

`df.corr()` and `df.cov()` will produce the correlation or covariance matrix.  Or two Series can be used to get the correlation (or covariance) with `Series1`.corr(`Series2`).

Numpy functions can also be used: `np.corrcoef()`

In [None]:
iris.corr(numeric_only = True)

---
## Dropping rows and columns

Columns and rows can be dropped with the `.drop()` method (using `axis=1` for columns and `axis=0` (default) for rows).  This method creates a new object unless `.inplace = True` is specified.

The `del` command can also be used to drop columns in place.

In [None]:
no_species = iris.drop('species', axis = 1)
no_species.head()

In [None]:
# The original is unchanged if inplace = False
iris.head()

## Adding columns

Add a new column to the end of a data frame
~~~
df['new_col'] = value
~~~

Add a new column at a specific index

`.insert(col_index, 'new_col_name', value(s))`

In [None]:
iris['sum_petal_dims'] = iris['petal_length'] + iris['petal_width']

In [None]:
iris.head()

## Using Apply

In [None]:
iris['sepal_length'].apply(np.log)

In [None]:
# What happened?
iris['sepal_length'].apply(np.mean)

In [None]:
iris.iloc[:, 0:4].apply(np.mean)

In [None]:
iris['species'].apply(lambda x: x.title())

In [None]:
iris['species'].str.lower()

In [None]:
def zero_one_scale(x):
    return (x - np.min(x)) / (np.max(x)- np.min(x))

In [None]:
## Why does this not work?
iris['sepal_length'].apply(zero_one_scale)

In [None]:
## Why does this not work?
iris['petal_length'].apply(lambda x: zero_one_scale(x))

In [None]:
zero_one_scale(iris.petal_length)

## Missing Values

**Ways to count missing values**
~~~
df.info()
df.isna().sum()
df.isna().sum(axis=0)
~~~

**Drop missing values with `.dropna()`**

Calling `.dropna()` without any arguments will drop all rows with missing values

Arguments:
* `axis=1` will drop columns with missing values (default is `axis=0`)
* `how='all'` will drop rows (or columns) if all the values are NA (default is `how='any'`)
* `subset=` will limit na search to these specic columns (or indexes)
    

**Fill missing values with `.fillna()`**
Arguments:
* `value`: value used to fill.
* `method'`: methods used to fill (forward or backward fill)


In [None]:
# Creating a DataFrame with missing values
missing_data = {'A': [1, 2, np.nan],
        'B': [np.nan, 4, 6],
        'C': [7, 8, 9]}

m_df = pd.DataFrame(missing_data)
m_df

In [None]:
m_df.info()

In [None]:
# Check for missing values
m_df.isna()

In [None]:
# Check how many missing values
m_df.isna().sum()

In [None]:
# Check how many missing values
m_df.isna().sum(axis = 1)

In [None]:
# Fill missing values
df_filled = m_df.fillna(-1)
df_filled

In [None]:
m_df

In [None]:
# Fill with mean column value
m_df.fillna(m_df.mean())

In [None]:
# Remove rows with missing values
df_dropped = m_df.dropna()

# Display the cleaned DataFrame
df_dropped

In [None]:
df.head()

## Groupby, Aggregation

### Use Titanic data example here

In [103]:
## Titanic data
# from sklearn.datasets import fetch_openml
# dat = fetch_openml(data_id=40945, parser = 'auto')
# titanic = dat.frame

titanic = pd.read_csv('https://raw.githubusercontent.com/rhodes-byu/cs180-winter25/refs/heads/main/data/titanic.csv')

In [None]:
titanic.head()

In [None]:
titanic.drop('name', axis = 1, inplace = True)

In [None]:
# Average age by sex
age_by_sex = titanic.groupby('sex')['age'].mean()

# Display the aggregated data
print("Age By Sex:\n", age_by_sex)


In [None]:
# Multiple Grouping Categories
titanic.groupby(['sex', 'pclass'])['age'].mean()

In [None]:
# Multiple Target Variables
titanic.groupby(['sex'])[['age', 'fare']].mean()

In [None]:
# Multiple Aggregations
titanic.groupby('sex')['age'].agg(['mean', 'max', 'min', 'sum']).round()


In [None]:
# Define a custom aggregation function
def range(series):
    return series.max() - series.min()

titanic.groupby('sex')['fare'].agg(range)


In [None]:
# Group data by 'Region' and apply named aggregations to multiple columns
region_summary = titanic.groupby('home.dest').agg(
    total_fare=('fare', 'sum'),
    agerage_fare=('fare', 'mean'),
    average_age=('age', 'mean')
)

# Display the summary for each region
print("Region-wise Summary:\n", region_summary)


### Combining DataFrames

In [None]:
import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                    'B': ['B0', 'B1', 'B2']})

df2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'],
                    'B': ['B3', 'B4', 'B5']})




In [None]:
df1

In [None]:
df2

In [None]:
# Concatenate DataFrames vertically
result = pd.concat([df1, df2], axis=0)

# Display the concatenated DataFrame
print("Concatenated DataFrame:\n", result)

In [None]:
# Create two DataFrames with a common column 'key'
left = pd.DataFrame({'key': ['A', 'B', 'C'],
                     'value_left': [1, 2, 3]})

right = pd.DataFrame({'key': ['B', 'C', 'D'],
                      'value_right': [4, 5, 6]})


In [None]:
left

In [None]:
right

In [None]:
# Merge DataFrames based on the 'key' column
merged_inner = pd.merge(left, right, on='key', how='inner')

# Display the merged DataFrame
print("Inner Merge:\n", merged_inner)

In [None]:
merged_outer = pd.merge(left, right, on='key', how='outer')
print("Outer Merge:\n", merged_outer)

In [None]:
# Create a DataFrame with wide-format data
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Math_Score': [90, 85, 78],
        'Science_Score': [88, 92, 80]}

df = pd.DataFrame(data)

print(df)

### Reshaping DataFrames

In [None]:
# Melt the DataFrame to long-format
melted_df = pd.melt(df, id_vars=['Name'], var_name='Subject', value_name='Score')

# Display the melted DataFrame
print("Melted DataFrame:\n", melted_df)

In [None]:

# Define a mapping function to assign letter grades
def assign_grade(score):
    if score >= 90:
        return 'A'
    elif score >= 80:
        return 'B'
    elif score >= 70:
        return 'C'
    else:
        return 'F'

# Apply the mapping function to create a new column 'Grade'
melted_df['Grade'] = melted_df['Score'].map(assign_grade)

# Display the DataFrame with letter grades
print(melted_df)


In [None]:
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Scott', 'Liz'],
        'Age': [28, 45, 60, 34, 50, 40]}

df = pd.DataFrame(data)

# Define bin edges and labels for age groups
bin_edges = [0, 30, 40, 50, 100]
bin_labels = ['0-30', '31-40', '41-50', '51+']

# Use the `cut` function to create a new column 'AgeGroup'
df['AgeGroup'] = pd.cut(df['Age'], bins=bin_edges, labels=bin_labels)

# Display the DataFrame with age groups
print(df)


### Reading in Data

In [None]:
import numpy as np
import json
import os

In [None]:
# Make the data directory if it doesn't exist
if not os.path.exists('data'):
    os.makedirs('data')

In [None]:
# Write DataFrame to a CSV file
df.to_csv('data/data.csv', index=False)

# Read data from a CSV file
new_df = pd.read_csv('data/data.csv')

# Display the DataFrame
print(new_df)


### Read in CSV

In [None]:
# Downloading data to data/ directory (May not work on Windows)
!curl -L -o data/example_csv.csv https://raw.githubusercontent.com/rhodes-byu/stat386-datasets/refs/heads/main/reading_examples/example_csv.csv

In [None]:
df = pd.read_csv('data/example_csv.csv', index_col = 0, thousands = ',')

In [None]:
df.head()

## read_excel

In [None]:
!curl -L -o data/example_excel.xlsx https://raw.githubusercontent.com/rhodes-byu/stat386-datasets/refs/heads/main/reading_examples/example_excel.xlsx

In [None]:
df = pd.read_excel('data/example_excel.xlsx', sheet_name=None)

In [None]:
df

In [None]:
df.keys()

In [None]:
lines = df['lines']

In [None]:
lines.head()

In [None]:
df = pd.read_excel('data/example_excel.xlsx', sheet_name = 'lines')

In [None]:
df

## json files

In [None]:
!curl -L -o data/example_json.json https://raw.githubusercontent.com/rhodes-byu/stat386-datasets/refs/heads/main/reading_examples/example_json.json

In [None]:
pd.read_json('data/example_json.json')

In [None]:
with open('data/example_json.json', 'r') as file:
    json_object = json.load(open('data/example_json.json', 'r'))

print(json_object)

In [None]:
json_object['cap']

In [None]:
pd.DataFrame(json_object)

In [None]:
pd.json_normalize(json_object)

## read_html

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_Super_Bowl_champions'

In [None]:
# Reads in a list of DataFrames from the URL (based on tables)
dfs = pd.read_html(url)

In [None]:
len(dfs)

In [None]:
dfs[9].head()

In [None]:
pd.read_html(url, match = 'Joe Robbie')[0]

## Getting multiple pieces of information from a single column

### Unpacking

Many times, a single column will contain multiple pieces of information.  Learning how to extract this information is extremely important and is a great skill to have.

If it is possibe to somehow separate or split the elements in the column, this is a much easier and more effecive way of extracting information than simply extracting info based on slicing.

For example, suppose we have a list of cities with the state.  We want to separate the city and the state into individual columns.  

In [None]:
cities = pd.Series(['Provo, Utah', 'Omaha, Nebraska', 'Fremont, Ohio','Green River, Wyoming', 'Durham, North Carolina' ])
cities

In [None]:
for name in cities:
    print(name)

This looks like a hard problem because there are different lengths for each city and state name.  Some of the city names and state names even have spaces.  We recognize that there is a common format.  The city names are all separated from the state name by a comma.  We can use the string method ``.split("character")`` to separate the words in a string based on ``"character"``.  

By default, ``.split()`` will separate on spaces.

In [None]:
s = 'Provo, Utah'
s.split()

In [None]:
s.split(",")

In [None]:
cities.apply(lambda x: x.split(","))

Or we could use  ``.str`` with ``.split``

In [None]:
cities.str.split(",")

Now we have a list of lists.  Next we need the get the information out.  We know that our Series had only one comma and when we split on the comma (using ``.split(",")``) everything before the comma is the first item in the list and everything after the comma is the second item in the list.  
In our example, the first item is the city name and the second item is the state name.

Here are a couple of ways to extract the data that was split.

**First using a ``for`` loop:**

Notice that the state variable has white space, so we can strip that inside our for loop:

In [None]:
# for loop
cities_split = cities.str.split(",")

state = []
city = []
for item in cities_split:
    city.append(item[0].strip())
    state.append(item[1].strip())

In [None]:
cities_split

In [None]:
state

In [None]:
city

**Second using ``.apply`` and ``lambda`` functions:**

In [None]:
# apply with lambda function
city = cities_split.apply(lambda x:x[0].strip())
state = cities_split.apply(lambda x:x[1].strip())

In [None]:
state

In [None]:
city

**Another Example**

Here, suppose I have times in the format ``hour:minute:second``.  I want to make a variable that combines these into just one time.  Since the lowest resolution is seconds, I will make a variable for "seconds".

In [None]:
times = pd.Series(['01:34:07','00:35:12','00:00:16','03:59:00'])

In [None]:
time_list = times.str.split(":")
time_list

In [None]:
seconds = []
for time in time_list:
    temp = int(time[0])*60*60 + int(time[1])*60 + int(time[2])
    seconds.append(temp)


In [None]:
seconds