# Pandas Cheat Sheet
This notebook is for brief snippets of helpful pandas commands that I've accumulated over time. Because I don't use pandas everyday, I've previously kept all this in an Evernote, but figured a Jupyter notebook would be more appropriate to give examples

## Combining multiple CSVs in a directory into a dataframe
CSVs typically work a bit faster since the read_csv method has the chunksize argument, and read_excel does not. Getting filters in on each chunk before it's concatenated into a combined dataframe makes this a ton more manageable

In [None]:
import os
import os.path
import pandas as pd
import numpy as np

# if this is a script to be run from a cron job:
# SCRIPT_PATH = os.path.dirname(os.path.abspath(__file__))
# otherwise use '.' because __file__ does not exist in a jupyter notebook

SCRIPT_PATH = '.'
DATA_FOLDER = os.path.join(SCRIPT_PATH, 'data')
CHUNKSIZE = 10000

csvs_in_dir = [os.path.join(DATA_FOLDER, f) for f in os.listdir(DATA_FOLDER) if f.endswith('csv')]

combined_df = pd.DataFrame()

for file in csvs_in_dir:
    for chunk in pd.read_csv(file, chunksize=CHUNKSIZE):
        chunk = chunk.replace(np.nan, 'None')
        # Add any other filters here
        combined_df = pd.concat([combined_df, chunk])

## Converting a column to date/time
For whatever reason, our date/time has never been formatted where pandas could auto-detect a datetime dtype. For the string representation, I typically use https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

In [None]:
DTTM_STR_FORMAT = '%m/%d/%Y %H:%M'

combined_df['ACTIVITY_DT_TM'] = pd.to_datetime(combined_df['ACTIVITY_DT_TM'], format=DTTM_STR_FORMAT)
combined_df['SERVICE_DT_TM'] = pd.to_datetime(combined_df['SERVICE_DT_TM'], format=DTTM_STR_FORMAT)

## Replacing column values with regex
See also: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html
This is helpful if you need to make sweeping changes or formatting changes to the values of a column

In [None]:
# Probably not the best way:
pd.options.mode.chained_assignment = None
df['ORDERED_AS_MNEMONIC.1'].replace(regex=True, inplace=True, to_replace=r' \(ONC\)', value=r'')b

## Using the group by function to visualize continuous values

In [None]:
df.groupby(pd.cut(df3['DIFF'], np.arange( -1830, 200, 20)))['DIFF'].count()
# ['DIFF'] includes what columns to return
# count() is the aggregation function
# pd.cut separates out the continuous values by the range denoted by np.arange()

## Quick aggregation/summarizing columns
Summarizing can be done by creating a ```Series``` or a ```DataFrame```

With a ```Series```, the column in the ```groupby``` method becomes the index, and the column in brackets is subject to the aggregation function

With the ```dataframe```, we can have multiple aggregation functions in the agg method

In [None]:
# Creating a series
df.groupby('column_name')['column_name'].count()

# Creating a dataframe
df[['column1', 'column2']].groupby('column1').agg(['min', 'max'])

## Sorting values on something that has a MultiIndex
If you have a multi-index, if you do a quick glance at your dataframe, you should see levels in your column names. You can sort things by putting the different level names into a tuple with the sort_values method

In [None]:
reviews[['variety', 'price']].groupby('variety').agg(['min', 'max']).sort_values([('price', 'min'), ('price', 'max')], ascending=False)