# Pandas

##### Magic Commands

_| Print all output_ <br>
from IPython.core.interactiveshell import InteractiveShell <br>
InteractiveShell.ast_node_interactivity = "all"


| Time code execution <br>
%timeit -n 5 movie['year'].sort_values().drop_duplicates().head()
%matplotlib inline

##### Reading in, parse dates
emp = pd.read_csv('../data/employee.csv', parse_dates=['hire_date'])

##### Setting DF Index

_Set index_ <br>
df2 = df.set_index('col_name')

_Set index & validate uniqueness_ <br>
movie2.set_index('title', verify_integrity=True)

_Set Index on read_ <br>
movie = pd.read_csv('../data/movie.csv', index_col = 'title')

##### Modifying display options

_Set max cols and rows_ <br>
pd.set_option('display.max_columns', 100, 'display.max_rows',4)

##### EDA

_df.info_ <br>
column, non-null count, dtype

_df.count_ <br>
non-null count of values (can use axis = 1 to count per row)

_df.describe_ <br>
summary stats (count, mean, std, 5 num sum) per column

##### Grouping and aggregation
avg_duration = movie.groupby('year').agg({'duration':'mean'})

## Subset selection

#### Just the brackets <br>
Single column - df['col_name']

Several columns - df[['col1', 'col2', 'col3']] <<< note that column names must be in brackets (a list)


#### .loc (Separate row and column selection with a comma for loc) <br>
The nice benefit of loc is that it allows us to simultaneously select rows with boolean selection and columns by label.
df.loc[start:stop:step, 'col_name'] <br>

df.loc[:, ['col1', 'col2']] 

df.loc['start_row':'stop_row'] <<< note that this will return all columns, without specifying this with a colon

df.loc['Niko'] <<<< selecting just a row name will return a series, with col names as index

df.loc[['Niko']] <<<< selecting a row name in a list will return a dataframe

#### Logical operators

Core Python provides the logical operators `and`, `or`, and `not` to combine multiple conditions together. These operators always return a boolean value. 

These built-in logical operators do not work for creating multiple conditions with a **boolean Series**. Instead, you must use the following operators.

* `&` for and (ampersand character)
* `|` for or (pipe character)
* `~` for not (tilde character)

###### The `query` method

The `query` method allows you to filter the data by writing the condition as a string. Unlike boolean selection, you can use the strings "and", "or", and "not" instead of the operators &, |, and ~ which further aides readability with query. By default, all words within the query string attempt to reference a column name. You can, however, reference a variable name by preceding it with the @ symbol

Unfortunately the `query` method does not give us the ability to select a subset of the columns when filtering the data. You would have to do normal column selection after calling the method. Here, we use *just the brackets* to select three columns after finding all the rides where the weather was snow or rain.


In [2]:
5 > 3 or 10 > 20
5 > 3 and 10 > 20

filt1 = bikes['tripduration'] > 1000
filt2 = bikes['gender'] == 'Male'
filt = filt1 & filt2

bikes[(bikes['tripduration'] > 1000) & (bikes['gender'] == 'Male')]

# query allows you to use a string, and reference col names within the string
bikes.query('tripduration > 1000 and temperature > 85').head(3)

filt = bikes['events'].isin(['rain', 'snow', 'tstorms', 'sleet'])

filt = (movie['gross'].isna()) & (movie['budget'].isna())
cols = ['gross', 'budget']

movie.loc[filt, cols]


bikes.query('50 <= temperature <= 60').head(3)

#col to col comparison
bikes.query('start_capacity > end_capacity').head(3)

#multiple conditions
bikes.query('events in ["snow", "rain"]').head(3)

#arithmetic ops w/i the string
bikes.query('start_capacity - end_capacity >= 20').head(3)

#selecting columns w/ query
cols = ['starttime', 'temperature', 'events']
bikes.query('events in ["snow", "rain"]')[cols].head()

NameError: name 'bikes' is not defined

## Series 
### Sorting, ranking, uniqueness
By default, all missing values are placed at the end of the resulting Series when sorting. You can change this so that they appear first by setting the na_position parameter to 'first'. This is a good way to quickly view all the missing values in your Series. 

##### Uniqueness

There are a few methods that deal with unique values in a Series:

* `unique` - Returns **numpy array** of unique vals in order of appearance
* `nunique` - Returns # of unique values in the Series - to count missing vals, use dropna=False
* `drop_duplicates` - Returns **pandas Series** of just the unique values

In [None]:
#sorting
movie['duration'].sort_values(na_position='first').head()

movie['duration'].sort_index()

#ranking
score10.rank()

##### Arithmetic

`diff` takes the difference between current value and some other value (by default, immediately preceding value, use `periods` param to set interval)

`pct_change` returns the percentage difference

`sample` randomly sample values  in your Series (n param for num values, frac param to sample a fraction, like 15%, of the dataset, default is **without** replacement, use replace = True for replacement)

`replace(cur val, new val)` replaces values in Series w/ desired str, int, etc (use regex = True for interstring replacement)

`Taking the mean` of a `boolean` series (i.e. Trues and Falses) will return the percentage of the column that is True


### String and Datetime series
pandas provides a collection of methods only available to object, string, and datetime columns.

To access these special string-only methods, append the Series variable name with .str followed by . and then the specific string method (pandas refers to this as the str accessor)

title.str.lower().head()

Many of the datetime objects available are attributes (vs methods)

start.dt.year.head()
start.dt.day.head(3)
start.dt.minute.head(3)
start.dt.dayofweek.head() (0 = Monday, 6 = Sunday)

title.str.count() - # of occurances of passed char(s) per string
title.str.contains() - boolean series of denoting if string contains passed char(s)
title.str.len() - length of each string
title.str.split() - splits string on passed separator, defaults to blank space, use expand =    True to return dfr
title.str.replace() - replace specified string with given replacement string


###### select a substring from every string w/ just the bracketes
title.str[10].head(3)

### Datetime series

Many of the datetime objects available are attributes (vs methods)

start.dt.year.head()
start.dt.day.head(3)
start.dt.minute.head(3)
start.dt.dayofweek.head() (0 = Monday, 6 = Sunday)

attributes re: whether the datetime is start/ end of month, quarter,year

start.dt.is_month_start
start.dt.is_quarter_end


##### methods
ceil
round
floor
strftime
to_period

-- requires offset alias (H, min, S, D, W)

In [None]:
start.dt.strftime('On %A, %B %d, %Y at %X something great happened')

In [None]:
On Friday, June 28, 2013 at 19:01:00 something great happened

### Dataframe missing values

`df.dropna` can take a `subset` list of columns, or a `thresh` threshold for min number of non missing values in the row, in order to not be dropped (i.e. `thresh` = 5 would drop rows that have fewer than 5 columns of non null values)
Also has `axis` param to drop columns instead of rows

`df.fillna` can take an individual value, a dictionary mapping col name to fill value, a series i.e. mean or median, or a method like ffill or bfill <br>
    `med = df.median(numeric_only = True)` <br>
    `df.fillna(med)`
    
    `df.fillna(method = 'ffill')
    
`df.interpolate` fills missing values according to specified stat infer method, defaults to linear