# Pandas Reference Noteboook

## Tidy Data

Paper by Hadley Wickham <br>
https://vita.had.co.nz/papers/tidy-data.pdf

## Time Series Resampling

When using aggregates like `mean()`, `sum()`, `count()`, etc. pandas can be manipulated to adjust the sampling frequency. 

In [None]:
# 'D' indicates daily
daily_mean = sales.resample('D').mean()

# '2W' indicates bi-weekly
weekly = sales.resample('2W').mean()

# 'B' indicates per business day, ffill() forward fills
bdaily_mean = sales.resample('B').mean().ffill()

## Rolling/Moving Average

The primary purpose is to smooth out short term fluctuations. To use the `.rolling()` method, you must always use method chaining, first calling `.rolling()` and then chaining an aggregation method after it. 

In [None]:
hourly_data.rolling(window=24).mean()

## Imputing Missing Values

Missing values are trouble for machine learning. Fill them in with pandas. Note that you don't need to pass an argumen to `impute_median` below since you are calling it from `.transform()`.

In [None]:
# Write a function that imputes median
def impute_median(series):
    return series.fillna(series.median())

titanic.age = titanic.groupby(['sex'.'pclass'].age.transform(impute_median)

## Groupby and Filtering

Complex example that is probably super useful.

In [None]:
splitting = auto.groupby('yr')

type(splitting)
'pandas.core.groupby.DataFrameGroupBy'
type(splitting.groups)
'dict'

print(splitting.groups.keys())
'dict_keys([70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82])'

# groupby object: iteration
for group_name, group in splitting:
    avg = group['mpg'].mean()
    print(group_name, avg)

# groupby object: iteration and filtering
for group_name, group in splitting:
    avg = group.loc[group['name'].str.contains('chevrolet'), 'mpg'].mean()
    print(group_name, avg)
    
# groupby object: comprehension
chevy_means = {year:group
               .loc[group['name'].str.contains('chevrolet'),'mpg'].mean()
               for year,group in splitting}
pd.Series(chevy_means)

# boolean groupby
chevy = auto['name'].str.contains('chevrolet')
auto.groupby(['yr', chevy])['mpg'].mean()

## Grouping and filtering with .filter()

You can use groupby with the .filter() method to remove whole groups of rows from a DataFrame based on a boolean condition.

In [None]:
sales = pd.read_csv('sales.csv', index_col='Date', parse_dates=True)
by_company = sales.groupby('Company')
by_com_sum = by_company.Units.sum()
by_com_filt = by_company.filter(lambda g: g.Units.sum() > 35)

## Grouping and filtering with .map()

You may instead want to group by a function/transformation of a column. The key here is that the Series is indexed the same way as the DataFrame. You can also mix and match column grouping with Series grouping. Here, calling `.mean()` on the survived column returns the percentage that survived as survival is indicated by a 1.

In [None]:
under10 = titanic.age < 10
under10 = under10.map({True:'under 10',False:'over 10'})

# grouped by only under 10
survived_mean_1 = titanic.groupby(under10).survived.mean()

# grouped by under 10 and pclass
survived_mean_2 = titanic.groupby([under10,'pclass']).survived.mean()

## .idxmax() and .idxmin() methods

These two methods return the index of the max or min value respectively. This also can be used on columns with `df.T.idxmax(axis='columns')`.

## .nunique() method

Given a categorical Series `S`, `S.nunique()` returns the number of distinct categories.

## SQL

In [None]:
from sqlalchemy import create_engine
import pandas as pd

engine = create_engine('sqlite:///Chinook.sqlite')

# Example 1
df = pd.read_sql_query("SELECT * FROM Orders", engine)

# Example 2
df = pd.read_sql_query(
    'SELECT * FROM Employee WHERE EmployeeID >= 6 ORDER BY BirthDate'
    ,engine)

# Example 3 (inner join)
df = pd.read_sql_query("""
    SELECT Title, Name FROM Album INNER JOIN Artist on 
    Album.ArtistID = Artist.ArtistID"""
    ,engine)

# Example 4 (inner join with filtering)
df = pd.read_sql_query("""
    SELECT * FROM PlaylistTrack INNER JOIN Track ON 
    PlaylistTrack.TrackID = Track.TrackID 
    WHERE Milliseconds < 250000
    """
    ,engine)

## PyODBC

In [None]:
import pyodbc

# set up dsn, uid, pwd through local config utility

sql = ''' SQL HERE '''

with pyodbc.connect(f'DSN={‘dsn’}; UID={‘uid’}; PWD={‘pwd’}') as connection:
    df = pd.read_sql(sql, connection)

## List Comprehension within a DataFrame

Here we use list comprehension to populate a new column in a DataFrame. The two values of each tuple are multiplied and the result is added to new column `'Total Urban Population`'.


In [None]:
# Example only, cell will not run
pops = zip(df_pop_ceb['Total Population'], 
           df_pop_ceb['Urban population (% of total)'])
pops_list = list(pops)

df_pop_ceb['Total Urban Population'] = [int(p1*p2*0.01) for p1,p2 in pops_list]

## Replacing characters in column names

Here we use `.str.replace()` to replace a single character within a set of column names.

In [None]:
temps_c.columns = temps_f.columns.str.replace('F','C')

## Splitting a Column, .split() and .get()

Another common way multiple variables are stored in columns is with a delimiter. First split the string using `.split()`, then retrieve the first element with `.get(0)` and the second element with `.get(1)`. <br>

Remember to access the `.str` attribute before applying `.split()` or `.get()`.

In [None]:
# splitting a column
ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')

# retrieving first word
ebola_melt['type'] = ebola_melt.str_split.str.get(0)

# retrieving second word
ebola_melt['country'] = ebola_melt.str_split.str.get(1)

## Globbing (glob)

The `glob` module can be used to match patterns and return a list. This is useful if you need to load multiple data files in pandas.

In [None]:
import pandas as pd
import glob

pattern = '*.csv' # * multi, ? single
csv_files = glob.glob(pattern)
frames = []

for csv in csv_files:
    df = pd.read_csv(csv)
    frames.append(df)
    
uber = pd.concat(frames, ignore_index=True)

## Regular Expressions (RegEx)

Cheat sheet: <br>
https://www.rexegg.com/regex-quickstart.html

In [None]:
import re

# example 1
pattern = re.compile('\$\d*\.\d{2}')
result = pattern.match('$17.89')
bool(result)

# example 2
prog = re.compile('\d{3}-\d{3}-\d{4}')
result = prog.match('123-456-7890')
bool(result)

# example 3 (findall())
matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana')
print(matches)

## Lambda Function and RegEx

The following function removes the $ dollar sign from the values in the column.

In [None]:
tips['total_dollar_re'] = 
    tips.total_dollar.apply(lambda x: re.findall('\d+\.\d+', x)[0])

## Testing with asserts

You can use `assert` to programmatically check for missing data vs visually inspection. Note that by chaining two `.all()` methods you will first check each value, then check the results against themselves. <br>
<br>
Note that `ebola` is a DataFrame.

In [None]:
# check for missing values
assert ebola.notnull().all().all()

# check that values are >= 0
assert (ebola >= 0).all().all()

# check for null
assert df.isull().any(axis=None)

## Loading Multiple DataFrames

Using a for loop with a list of filenames:

In [None]:
filenames = ['Gold.csv', 'Silver.csv', 'Bronze.csv']

dataframes = []
for filename in filenames:
    dataframes.append(pd.read_csv(filename))

Using a for loop with string interpolation:

In [None]:
for medal in medal_types:
    file_name = "%s_top5.csv" % medal
    columns = ['Country', medal]    

Using a dictionary of DataFrames:

In [None]:
# Load DataFrame from file_path: editions
editions = pd.read_csv(file_path,sep='\t')

# Extract the relevant columns: editions
editions = editions[['Edition','Grand Total','City','Country']]

# Create empty dictionary: medals_dict
medals_dict = {}

for year in editions['Edition']:

    # Create the file path: file_path
    file_path = 'summer_{:d}.csv'.format(year)
    
    # Load file_path into a DataFrame: medals_dict[year]
    medals_dict[year] = pd.read_csv(file_path)
    
    # Extract relevant columns: medals_dict[year]
    medals_dict[year] = medals_dict[year][['Athlete','NOC','Medal']]
    
    # Assign year to column 'Edition' of medals_dict
    medals_dict[year]['Edition'] = year
    
# Concatenate medals_dict: medals
medals = pd.concat(medals_dict, ignore_index=True)

# Print first and last 5 rows of medals
print(medals.head())
print(medals.tail())

## Combining DataFrames

Here are simplified guidelines for deciding which pandas method to use when combining DataFrames (ranked roughly in order of increasing complexity).

`df1.append(df2)`
- stacking vertically <br>

`pd.concat([df1,df2])`
- stacking many horizontally or vertically
- simple inner/outer joines on indexes  <br>

`df1.join(df2)`
- inner/outer/left/right joins on indexes <br>

`pd.merge([df1,df2])`
- many joins on multiple columns <br>

`pd.merge_ordered([df1,df2])`
- same as merge but for ordered data like time series data <br>

`pd.merge_asof([df1,df2])`
- similar to `pd.merge_ordered()` except this will also merge values in order using the 'on' column, but for each row in the left DataFrame, only rows from the right DataFrame whose 'on' column values are less than the left value will be kept

## Arithmetic Operations on DataFrames and Series

Series with common indexes can be manipulated with simple arithmetic operations such as `+` and `-` as is. Specific functions and methods may be useful when indexes do not align exactly.

#### Divide with `.divide()`

In [None]:
week1_range.divide(week1_mean, axis='rows')

#### Add with `.add()`

In [None]:
temperature.Kelvin = temperature.Celsius.add(273.15)

#### Multiply with `.multiply()`

In [None]:
pounds = dollars.multiply(exchange['GBP/USD'],axis='rows')

#### Percent change with `.pct_change()`

In [None]:
# percent change
week1_mean.pct_change() * 100 # multiply by 100 to show as percentage