<a href="https://colab.research.google.com/github/zackives/upenn-cis-2450/blob/main/cis2450lab2nb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CIS 2450 Lab 2: Pandas

###September 8, 2024



# Python & Libraries

Before starting with Pandas, let's take a look at the relationship between core Python and third-party libraries.

There are myriad third-party libraries which users can `install` (download) and `import` to perform certain tasks (eg. Pandas for data analysis, matplotlib for visualizations).

If you are running Python for the first time on your local machine, you may need to run `pip install` on the third-party libraries.

Google Colab has already `installed` some of the commonly used packages (such as Pandas, Numpy, Matplotlib), so we do not need to run pip install every time we restart runtime. However, you still need to `import` them!

<img src = 'https://drive.google.com/uc?id=18IEGTtHQM1HUPHiws3wpA8P08zxRT0It'>



# Pandas 🐼

Pandas is an open-source data analysis and manipulation library for Python, widely used in data science, machine learning, and other data-centric tasks. It provides data structures like Series (one-dimensional labeled arrays) and DataFrame (two-dimensional labeled data tables, similar to Excel spreadsheets) that are highly efficient for handling and analyzing structured data.

Some key features of Pandas include:

- Data Manipulation: Easy-to-use tools for cleaning, filtering, sorting, and grouping data.
- Data Wrangling: Support for handling missing data, merging, reshaping, and pivoting datasets.
- Data I/O: Capabilities to read from and write to various file formats such as CSV, Excel, SQL databases, JSON, and more.
- Data Aggregation: Functions for calculating statistics like mean, sum, median, and standard deviation, with options for grouping and applying custom functions.

Pandas simplifies data handling and is highly integrated with other data analysis libraries in Python, making it a cornerstone for data manipulation tasks.

In [None]:
import pandas as pd
import numpy as np

## Series and DataFrames

* **Series:** one-dimensional array with hashable axis labels. Parameter is an iterable array-like object, such as lists, dicts, etc.
* **DataFrame:** two-dimensional, size-mutable tabular data, consisting of columns of Series. Parameter is an array-like object or DataFrame.

In [None]:
sports = pd.Series(['football', 'basketball',' volleyball','tennis']) #list

population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3,
                        'United Kingdom': 64.9, 'Netherlands': 16.9}) #dict

countries = pd.DataFrame({'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']})

In [None]:
sports

In [None]:
type(sports)

In [None]:
population

In [None]:
type(population)

In [None]:
population.index

In [None]:
population.values

In [None]:
population / 100

To access dataframe variables, use the `.` operator or brackets `[ ]`, or access multiple columns with `[[ ]]`

In [None]:
population['Netherlands']

In [None]:
population.Netherlands

In [None]:
countries

In [None]:
countries['population']

In [None]:
countries[['population', 'capital']]

In [None]:
type(countries)

In [None]:
type(countries.area) # single column from df is series

In [None]:
countries['area']

We can also access dataframes using conditional operators, such as:

In [None]:
countries.capital == 'London'

In [None]:
# Extract data for UK/London
countries[countries.capital == 'London']

In [None]:
# We can also do this without the .
countries[countries['capital'] == 'London']

In [None]:
# Getting all countries with area > 100k!

countries[countries['area'] > 100000]

#### EXERCISE: Series and DataFrames

Which of the following results in a series?



In [None]:
# A.
df[df['city'] == 'Boston']

# B.
df[df['country'] == 'Germany']['population'] / 100

# C.
df[['age']].applymap(lambda x: ageGroup(x))

## Creating New Columns
Adding columns to the DataFrame!

In [None]:
# basic assignment
countries['newVar'] = [1,2,3,None,None]
countries

In [None]:
# using existing columns for assignment
# here, we are creating a COMPOSITE VARIABLE defined as:
# "2 * population + sqrt(area)" for each row in countries
countries['newVar2'] = countries.population * 2  + countries.area**0.5
countries

## Apply

Apply is a very powerful method which can be used for making major data manipulation tasks. Much faster than standard for loops because of internal optimizations.

NOTE: In some assignments, your code could never finish running if you use for loops due to the size of the datasets!

Apply to a dataframe

<img src = 'https://drive.google.com/uc?id=15yFJR7MEMLZdl_GGV-gH6NzgumQ7YfWR'>


Apply to a single column

<img src = 'https://drive.google.com/uc?id=1BH1mHUNCscEelzn9dFvv386AXrszxwwD'>


Let's practice with a simple dataframe ```df``` which contains a single ```Age``` column.

In [None]:
df = pd.DataFrame({'Age': [1, 2, 19, 39, 50]})
df

In [None]:
## APPLY to dataframe
# add 10 to all ages less than 50

df['NewAge_simple'] = df.apply(lambda x: x['Age'] + 10 if x['Age'] < 50 else x['Age'], axis=1)

# the code above can also be written as ...

def addTen(num):
	return num + 10
df['NewAge_function'] = df.apply(lambda x: addTen(x['Age']) if x['Age'] < 50 else x['Age'], axis=1)

## APPLY to column
# the code above can also be written as ...
df['NewAge_simple_col'] = df['Age'].apply(lambda x: x + 10 if x < 50 else x)

# the code above can also be written as ...
df['NewAge_function_col'] = df['Age'].apply(lambda x: addTen(x) if x < 50 else x)

df

#### EXERCISE: Apply

Which of the following is incorrect?


In [None]:
# A.
df['Adult'] = df.apply(lambda x: True if x['Age'] >= 18 else False, axis=1)

# B.
df['Adult'] = df['Age'].apply(lambda x: True if x >= 18 else False)

# C.
df['Adult'] = df['Age'].apply(lambda x: True if x >= 18 else False, axis=1)

Another exercise

In [None]:
# Let's call ageBucket on every element in Age, and set that as a new column!

def ageBucket(x):
    if x<18:
        return "A. <18"
    elif x<25:
        return "B. 18-25"
    elif x<45:
        return "C. 25-45"
    else:
        return "D. >45"

##### EXERCISE #####


In [None]:
df['AgeBucket2'] = df.apply(lambda x : ageBucket(x['Age']),axis=1)
df.head()

In [None]:
df.applymap(lambda x: str(x) + "--")

Other derivative methods that you can look into are `map` and `applymap`.
* `map` works only on Series but has the same functionality as `apply`.
* `applymap` works only on dfs and applies to every element excluding the target column.

## Groupby Operations

##### Some 'theory': the groupby operation (split-apply-combine)

The "group by" concept: we want to **apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets**

This operation is also referred to as the "split-apply-combine" operation, involving the following steps:

* **Splitting** the data into groups based on some criteria
* **Applying** a function to each group independently
* **Combining** the results into a data structure

<img src="https://github.com/CIS-519/primer-dev/blob/master/pandas-tutorial-master/img/splitApplyCombine.png?raw=1">

Similar to SQL `GROUP BY`

In [None]:
taylor_df = pd.read_csv('https://storage.googleapis.com/penn-cis5450/taylor_swift_spotify.csv')

In [None]:
taylor_df.groupby('album')

In [None]:
taylor_df.groupby('album').size()

In [None]:
taylor_df.groupby('album').size().sort_values(ascending=False)

Grouping on multiple columns

In [None]:
taylor_df.groupby(['album','popularity']).size().reset_index()

In [None]:
taylor_df.groupby(['album','popularity'])['valence'].mean().reset_index()

## Merge Operations

Merging with Pandas works pretty much the same as SQL. There are four merge methods:
1. Left
2. Right
3. Inner
4. Outer

Basic syntax : pd.merge(left_dataframe, right_dataframe, left_on="some_column", right_on="some_column", how="left|right|inner|outer)`

In [None]:
population = pd.DataFrame({'country': ['Germany', 'Belgium', 'France',
                        'United Kingdom', 'United States'],'population': [81.3, 11.3, 64.3, 64.9, 65.9]})


In [None]:
population

In [None]:
countries

In a Left Merge we are mostly concerned with data on the LEFT side but we would like to add data from
the RIGHT side if it has some of the same countries in this case.

In [None]:
pd.merge(left=population, right=countries, on="country", how="left")

In a Right Merge we are mostly concerned with data on the RIGHT side but we would like to add data from
the LEFT side if it has some of the same countries in this case.

In [None]:
pd.merge(left=population, right=countries, on="country", how="right")

With an Inner Merge, we chop up both dataframes and only glue the stuff that matches. If a country isn't in both
dataframes, we don't keep it and we don't add NaN's. If no type of join is mentioned, then inner join is the
default join.

In [None]:
pd.merge(left=population, right=countries,on ='country')

In [None]:
pd.merge(left=population, right=countries,on ='country', how = "inner")

With an Outer Merge, we chop up both dataframes and keep everything from both sides. Then we toss in NaN's to fill
any blanks.

In [None]:
pd.merge(left=population, right=countries,on ='country', how = "outer")

#### Missing Data
How to handle missing data (NaN's)? Most common commands used are fillna and dropna.

In [None]:
missing_df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one', 'two', 'three'])
missing_df

In [None]:
missing_df['four'] = 'bar'
missing_df['five'] = missing_df['one'] > 0
missing_df.loc[['a','c','h'],['one','four']] = np.nan
missing_df

In [None]:
# fillna replaces NA/NAN values with the given value in the command.
missing_df.fillna(0)

In [None]:
missing_df['one'].fillna('missing')

Dropna is used to drop the rows or columns with NA/NAN values.
<br>
'axis' argument determines if rows or columns which contain missing values are removed.
<br>
'axis =0': Drop rows which contain missing values.
<br>
'axis =1': Drop columns which contain missing value.
<br>


'how' argument determines if row or column is removed from DataFrame, when we have at least one NA or all NA.
<br>
‘how = any’ : If any NA values are present, drop that row or column. (default)
<br>
‘how = all’ : If all values are NA, drop that row or column.
<br>

In [None]:
missing_df.dropna(axis=0)

In [None]:
missing_df.dropna(axis=1)

In [None]:
missing_df['six'] = np.nan
missing_df

In [None]:
missing_df.dropna(axis=1, how = 'all')

In [None]:
#dropping rows only where some columns are missing
missing_df.dropna(subset = ['one', 'two', 'four'])

In [None]:
df.head()