# Python basics

In [2]:
x = [1, 2, 3]
# Handy ways to copy a list:
y = list(x)
y = x[:]

In [3]:
# Handy list methods:
x.append(2)
x.count(2)

2

# Numpy arrays

- `numpy.array()` are especially useful for *element-wise* calculations.
- Slicing: `my_array[rows, columns]`
- All elements must be same data type.
- The `+` operator does an *element-wise sum* in numpy arrays instead of concatinating lists
- `numpy.corrcoef()` docs: https://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.corrcoef.html

In [4]:
import numpy as np
# Creates a *boolean mask* array
np_x = np.array(x)
gt_1 = np_x > 1
# Which can be used to slice another ndarray
print(np_x[gt_1])

[2 3 2]


In [5]:
# Interesting *type coercion* behavior for np arrays: booleans are converted to 0/1 and 
# element-wise operations are applied as usual.
np.array([True, 1, 2]) + np.array([3, 4, False])

array([4, 5, 2])

## array boolean operators for numpy and pandas
- `logical_and()`
- `logical_or()`
- `logical_not()`

In [6]:
np.logical_and(np_x > 2, np_x <= 3)

array([False, False,  True, False], dtype=bool)

In [7]:
# Selecting from np.array
np_x[np.logical_and(np_x > 2, np_x <= 3)]


array([3])

# pandas


## Index slicing methods:
- Full documentation here: [Indexing and Selecting Data](https://pandas.pydata.org/pandas-docs/stable/indexing.html)
- Brackets only (for convenience but *not recommended*).  Aka Python's `__getitem__`.
  - `df[]`
  - if *column label* is passed, returns *columns* as Series.
  - if *index slice using colon* is passed, returns *rows* as DataFrame.
- Accessors (*preferred* for consistency and internal optimization):
    - `.loc['row_selector', 'col_selector']` - *label*-based
      - Gotcha watch: Label *ranges* can be passed with ':' as well.  Careful not to let this get confusing.
    - `.iloc[row_selector, col_selector]` - *position*-based
    - `.ix[]` - combines label *and* index-based selection
    - Selectors can be passed as *lists* as well.
- As attributes:
  - You may access an index on a Series or column on a DataFrame directly as an attribute:
  - `series.AAPL`
  - `df.price`
- Gotcha watch: When slicing:
    - *One* pair of brackets returns a **Series** - `df1['selector']`
    - *Two* pairs of brackets returns a **DataFrame** - `df1[['selector']]`

In [8]:
# Notes setup
dict = {
    "country": ["Brazil", "Russia", "India", "China", "South Africa"],
    "capital": ["Brasilia", "Moscow", "New Delhi", "Beijing", "Pretoria"],
    "area": [8.516, 17.10, 3.286, 9.597, 1.221],
    "population": [200.4, 143.5, 1252, 1357, 52.98]
}

import pandas as pd
brics = pd.DataFrame(dict)
brics.index = ["BR", "RU", "IN", "CH", "SA"]
print(brics)

      area    capital       country  population
BR   8.516   Brasilia        Brazil      200.40
RU  17.100     Moscow        Russia      143.50
IN   3.286  New Delhi         India     1252.00
CH   9.597    Beijing         China     1357.00
SA   1.221   Pretoria  South Africa       52.98


In [9]:
# Select column as Series - Use ONE pair of brackets
print(brics["country"])
# Select column as DataFrame - Use TWO pairs of brackets
print(brics[["country"]])

BR          Brazil
RU          Russia
IN           India
CH           China
SA    South Africa
Name: country, dtype: object
         country
BR        Brazil
RU        Russia
IN         India
CH         China
SA  South Africa


In [10]:
# Only ROW selection can be done w/ numeric indexing
print(brics[1:3])

      area    capital country  population
RU  17.100     Moscow  Russia       143.5
IN   3.286  New Delhi   India      1252.0


In [11]:
# Using loc: Select rows by index name **as Series** - use one pair of brackets
print(brics.loc["RU"])
# Using loc: Select rows by index name **as DataFrame** - use TWO pairs of brackets
print(brics.loc[["RU"]])

area            17.1
capital       Moscow
country       Russia
population     143.5
Name: RU, dtype: object
    area capital country  population
RU  17.1  Moscow  Russia       143.5


In [12]:
# Add new calculated row with .apply()
brics['name_length'] = brics['country'].apply(len)
print(brics)

      area    capital       country  population  name_length
BR   8.516   Brasilia        Brazil      200.40            6
RU  17.100     Moscow        Russia      143.50            6
IN   3.286  New Delhi         India     1252.00            5
CH   9.597    Beijing         China     1357.00            5
SA   1.221   Pretoria  South Africa       52.98           12


# Iterating

Dictionaries:
- `for key, val in my_dict.items():`

Numpy arrays:
- `for val in np.nditer(np_x):`

Dataframes:
- `for label, row_series in df.iterrows():`

enumerate()
- **`enumerate()`** returns an enumerate object that produces a sequence of tuples which are **index-value pairs** (vs. not having an index available in *iterator* objects)
  - `for index1, value1 in enumerate(mutants):`

zip()
- Combines any number of iterables into a zip object which is an iterator of tuples
- Can be printed with a splat: 
  - `print(*zipped)`
- You can "unzip" zipped list by splat-unpacking into positional args then zip'ing back up:
  - `orig_list1, orig_list2 = zip(*zipped)`

## List comprehensions
- Syntax:
  - `[*output expression* for *iterator variable* in *iterable*]`
- Conditionals can be included in **output expression** (to filter and modify output) or at **end** (to filter output):
  - `[[*output expression* if *conditional on output*] for *iterator variable* in *iterable*]`
  - `[*output expression* for *iterator variable* in *iterable* if *conditional on iterable*]`
- Can be **nested** with this syntax ([output expression] is itself a L.C.):
  - `[[*output expression*] for *iterator variable* in *iterable*]`

In [13]:
matrix = [[col for col in range(0, 5)] for row in range(0, 5)]
print(matrix)

[[0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4]]


## Generators
Two ways to define generators:
1. List comprehension-like syntax:
   - Same syntax as list comprehensions except *parens ()* instead of brackets [].
   - Like a list comprehension except puts the output in an iterable **generator** object instead of an in-memory list.  "Lazy evaluation".  Handy for very large data sets which don't easily fit in memory.
2. Generator functions: 
   - produce an iterable **generator** object: `yield` instead of `return`

# Scopes
- "LEGB" order rule: Local, Enclosing, Global, Built-ins

- global: access vars outside functions
- nonlocal: access vars in enclosing scope (e.g. inside a nested function)

## Closure example

In [14]:
def raise_val(n):
    """Return the inner function."""
    
    def inner(x):
        """Raise x to the power of n."""
        raised = x ** n
        return raised

    return inner

square = raise_val(2) 
cube = raise_val(3)

print(square(4))
print(cube(5))

16
125


# Function tips
- `*args` gets turned into a tuple
- `**kwargs` gets turned into a dict

# Importing data

## Numpy import methods:
- `np.loadtxt(filename, delimeter=',', skiprows=1, usecols=[0, 2], dtype=str)`
- `np.genfromtxt(filename, delimiter=',', names=True, dtype=None)`
- `np.recfromcsv()` -- same as genfromtxt() but with dtype=None by default

## pandas import methods:
- `pd.read_csv(filename)`
- `dict = pd.read_excel(filename)`
  - Note the output of pd.read_excel() is a Python *dictionary* with sheet names as keys and corresponding DataFrames as corresponding values.
- `xl = pd.ExcelFile(filename)`
  - Get sheet names: `xl.sheet_names`
  - Then `df = data.parse("sheetname")`
  - Or by index `df = data.parse(0)`

**pickled files:**
- `open("filename.pkl", 'rb')`

**SAS files:**
```
with SAS7BDAT(filename) as file:
    df_sas = file.to_data_frame()
```
**Stata files:**
- `df = pd.read_stata(filename)`

**HDF5 files:**
- "Hierarchical Data Format" - Good for large amounts of *numerical* data
- `import h5py`
- `data = h5py.File(file, 'r')`
- Print *groups* in the file:
```
for key in data.keys():
    print(key)
```
- Then get value of data within groups:
  - `strain = data['strain']['Strain'].value`
  
**Matlab files:**
- `scipy.io.loadmat(filename)`
- `scipy.io.savemat(filename)`

**Relational databases:**
- With SQLAlchemy:
```
from sqlalchemy import create_engine
engine = create_engine('sqlite:///Northwind.sqlite')
with engine.connect() as con:
    rs = con.execute('select * from ...')
    df = pd.DataFrame(rs.fetchall())
    df.columns = rs.keys()
```
- Or super concise with pandas:
  - `df = pd.read_sql_query("SELECT * FROM Orders", engine)`
  
**From web directly:**
- With pandas:
  - `df = pd.read_csv(url, sep='\t')`
- With urllib:
```
from urllib.request import urlopen, Request
url = "https://www.wikipedia.org/"
request = Request(url)
response = urlopen(request)
html = response.read()
response.close()
```
- With Requests module:

In [15]:
import requests
url = "http://www.datacamp.com/teach/documentation"
r = requests.get(url)
text = r.text
#print(text)

# JSON output:
url = 'http://www.omdbapi.com/?apikey=ff21610b&t=social+network'
r = requests.get(url)
json_data = r.json()
print(json_data.keys())

dict_keys(['Response', 'Error'])


### BeautifulSoup formatting:

In [16]:
html_doc = text
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "lxml")
pretty_soup = soup.prettify()
html_text = soup.get_text()

# Data Cleaning & Transformation with pandas
(Need to increase **`pd.options.display.max_columns = 20`** for iPython in this DataCamp course for whatever reason.)

"A *matrix* has rows & columns.  A *DataFrame* has observations and variables." -- Hadley Wickham

Exploratory data analysis techniques:
- Frequency counts
  - `df.info()`
  - `value_counts(dropna=False)` - Also sorts by default, so good for rankings.
  - `unique()`
- Summary statistics
  - `df.describe()`
- Visualize the data:
  - boxplot for summary stats
  - bars for discrete
  - histograms for continuous
  - scatter for comparing numerical vars
- Note: When applying `.apply()` over an entire DataFrame, and not just one column or row, you'll have to chain the `.all()` method *twice*.
  
## [Tidy Data paper by Hadley Wickham](https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf)
"Tidy data" is a standard way of mapping the meaning of a dataset to its structure.  In tidy data:
1. Each variable (attribute) forms a column.
2. Each observation forms a row.
3. Each type of observational unit (entity) forms a table.
(Same as Codd’s 3rd normal form!)

Key definitions (framed in a language familiar to statisticians):
- **data set**: a collection of values
- **values** are organized in two ways:
    - **variable**: contains all values measuring an *attribute* (like height, temp, duration) *across units*.
    - **observation**: contains all values measured on the same *unit* ("entity" in relational DB terms) (like a person, a day, a race) *across attributes*. A measurement.


## Key exploratory methods
- head()
- info()
- columns
- dtypes
- describe()
- df.column.value_counts()
- df.column.plot('hist')


## Pivoting
A pivot table allows you to see all of your variables as a function of two other variables; one indexed as rows and one indexed as columns.
- Explained in other ways, pivoting:
  - Turns rows of data into columns.
  - Creates a new column for each unique value in a specified column.
  - Turns data from an "analysis-friendly" shape into a "reporting-friendly" shape (which violates "Tidy Data" principle of rows containing observations)
- The `index` param specifies columns NOT to pivot; to use as row indexes.
- `margins=True` param adds a grand total row.
- Use `df.reset_index()` to flatten the columns of the pivoted DataFrame

**`pivot()` example**:
```
df_pivoted = df.pivot(index='date', columns='element', values='value') 
```
**`pivot_table()` example** -- A generalization of `pivot()` that can handle duplicate values for one index/column pair (MultiIndex). `aggfunc` param can be used to remove duplicate values/rows for example:
- `index` and `columns` params can be lists in this function
```
df_pivoted = df.pivot_table(index='date', columns='element', values='value', aggfunc=np.mean) 
```


## Melting
- Turns set of columns of data into rows as a single column.  Explained in other ways, the goal of melting is to:
  - Change a DataFrame from a wide shape to a long shape.
  - Restore a *pivoted* DataFrame to its original form.
- Important params:
  - **id_vars** (*tuple, list, or ndarray; optional*): explicitly specifies *columns* that should remain in the reshaped DataFrame.
  - **value_vars** (*tuple, list, or ndarray; optional*): columns to convert into values ("unpivot").
  - **var_name**: Name to use for the ‘variable’ column. 
  - **value_name**: Name to use for the ‘value’ column.
  - **col_level**: If you have a multi-level *column* index, specify which level of it to melt.
```
pd.melt(frame=df, id_vars='name', 
    value_vars=['treatment a', 'treatment b'], 
    var_name='treatment', value_name='result') 
```

## Concatinating
- `pd.concat([df1, df2, df3])`
- `axis=0` by default

## Merging
- `pd.merge(left=df1, right=df2, on=common_key_name)`
- If key names in DFs are different: `pd.merge(left=df1, right=df2, left_on='key1', right_on=key2)` 

## Data types
- `df.dtypes` to see data types [note it's an *attribute*]
- Converting types:
  - `df['converted_col'] = df['source_col'].astype(str)`
  - `df['converted_col'] = pd.to_numeric(df['source_col'], errors='coerce')`
  
## Missing data
- `df1.drop_duplicates()`
- `df1.dropna()`
- `df1['sparce_column'].fillna()`
- `df1['sparce_column'].fillna(mean_value)`

## Regular expressions
- My jam.
- In Python, you can compile the regex first for speed

- Useful methods:
```
Series.str.contains(pattern)
```

In [17]:
import re
result = re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890')
print(type(result))
print(bool(result))
print(re.findall(pattern='\d+', string='5 strawberries and 16 bananas'))

<class '_sre.SRE_Match'>
True
['5', '16']


# [Course: pandas Foundations]
(Need to increase `pd.options.display.max_columns = 10` for iPython in this DataCamp course.)

## pandas Data Structures:
- An **Index** is a sequence of labels with.  Simple.
  - Has a homogeneous data type
  - Immutable (like dict keys)
  - *name* attribute
- A **Series** is a 1D array with an index.
  - The *values* of a Series are of type `numpy.ndarray` and can be returned with `series.values`.
- A **DataFrame** is a 2D labelled array whose columns are Series.  They share a common Index.
  - A *column* in a DataFrame is a Series. 
  
- The set of *columns* can also be named: `df.columns.name = 'Vegetables'`
  - This is equivalent to naming a "column index" just like a regular row index can be named.

## Rolling DataFrames from scratch
- Can be composed from a dictionary of key-value pairs (*keys* are column headers, *values* are rows).
- Example:

In [18]:
import pandas as pd

# Prepare two lists
list_keys = ('Country', 'Total')
list_values = (['United States', 'Soviet Union', 'United Kingdom'], [1118, 473, 273])

# Zip the 2 lists together into one list of (key,value) tuples
zipped = list(zip(list_keys, list_values))
print(zipped)

# Build a dictionary with the zipped list
data = dict(zipped)

# Build and inspect a DataFrame from the dictionary
df = pd.DataFrame(data)
print(df)

[('Country', ['United States', 'Soviet Union', 'United Kingdom']), ('Total', [1118, 473, 273])]


TypeError: 'dict' object is not callable

## Broadcasting

- Can be used when creating *new* DFs as well as modifying existing DFs
- Example:

In [None]:
cities = ['Manheim', 'Preston park', 'Biglerville', 'Indiana', 'Curwensville', 
          'Crown', 'Harveys lake', 'Mineral springs', 'Cassville', 'Hannastown', 
          'Saltsburg', 'Tunkhannock', 'Pittsburgh', 'Lemasters', 'Great bend']

# Construct a dictionary: data
data = {'state': 'PA', 'city': cities}

# Construct a DataFrame from dictionary data: df
df = pd.DataFrame(data)
print(df)

## Useful read_csv() params
- **header**: None or integer for telling it how many rows to consider headers
- **names**: column labels
- **index_col**
- **na_values**: values to consider as empty/NaN.  Can be a scalar, list, or dictionary (to specify which columns to apply this to).
- **parse_dates**: list of lists with column positions to treat as dates.  Each list is combined into one date field.


# Visual EDA

- Three diff plotting idioms with DataFrames (check documentation)!
  1. `df.plot(kind='hist')`
  1. `df.plt.hist()`
  1. `df.hist()`
- Useful *histogram* plot() options:
  - **bins**: number of bins
  - **range** (tuple): min and max ("extrema") of bins
  - **normed** (boolean): whether to normalize to one (aka Probability Density Function (PDF))
  - **cumulative** (boolean): whether to compute Cumulative Distribution Function (CDF). Requires normed=True as well.
  
# Key Statistical EDA Methods

- count()
- mean()
- std()
- median()
- quantile(q) -- Aka "percentiles". Computes median by default.
- min(), max()
- unique() -- Distinct categories
- nunique() -- *Number* of distinct categories
- idxmin(), idxmax()

Note all DF statistical methods *ignore null entries*.

# Time Series

## Indexing topics:
- An Index is a special type of Series.
- DateTimeIndex
- `df1.set_index('Date column name', inplace=True)`
- `pd.to_datetime()`
- Partial string indexing and slicing using datetimes.
  - Example showing datetime slice and the columns to include: `df1.loc['2010-Aug', 'Temperature']`
- Reindexing: `df1.reindex(index)`
  - `method='ffill'`
  - `method='bfill'`

## Resampling (downsampling, upsampling)
  - `df1.resample('freq string')`
  - Examples: 'D', '2W', '5A'
  - Common freqency strings (Full reference: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases):
  ```
‘min’,‘T’ minute
‘H’       hour
‘D’       day
‘B’       business day
‘W’       week
‘M’       month
‘Q’       quarter
‘A’       year
  ```
  - `df1.rolling(window=24).mean()`: Rolling mean, aka moving average.  **window=** specifies number of samples to aggregate.

## Datetime methods:
- `df1['datetime_column'].dt.hour`
- `df1['datetime_column'].dt.date`
- `df1['datetime_column'].dt.tz_localize('US/Pacific')`
- `df1['datetime_column'].dt.tz_convert('US/Central')`
- etc...

## Interpolation
Example using census data which is gathered once every decade:
- `population.resample('A').first().interpolate(how='linear')`

# [Course: Manipulating DataFrames with pandas]

## Filtering w/ boolean masks

Return a boolean mask:
- `df.column <conditional expression>`

Filtering zeros values:
- `df.all()` - returns only columns with *all non-zero* values.  Aka excludes cols containing zero values. (Gotcha watch!)
- `df.any()` - returns columns with *any non-zero* values.  Aka includes cols containing non-zero values. (Gotcha watch!)

Filtering NaN values:
- `df.isnull()`
- `df.notnull()`
- Can be chained with `.all()` and `.any()`
- `df.dropna(how='any|all')`
  - Other options: `thresh=<count of NaNs>`, `axis='columns|rows'`
  
## Transforming DataFrames

Vectorized methods (work on DFs and Series):
  - `df.floordiv(12)`
  - Universal Functions ("ufunc")
    - `np.floor_divide(df, 12)`
    - Numpy's operate on ndarrays in an element-by-element fashion.
  - Handy accessor: `str`
    - `df.str.lower()`
  - Arithmatic operators work element-wise too

Slower techniques (use Python for-loops internally):
- `df.apply(function)`
- `df.apply(lambda x: x//12)`
- An index is a special kind of Series
  - `.apply()` does NOT work on indexes -- use `map()` instead.
  - `df.index.map(str.lower)`
  
Delete a column with:
- `del df['col_name']`

## Hierarchical Indexes
Full documentation: [MultiIndex / Advanced Indexing](http://pandas.pydata.org/pandas-docs/stable/advanced.html)

- A **MultiIndex** contains multiple, possibly hierarchical column names. Composed as a tuple of column names.
  - `stocks = stocks.set_index(['Symbol', 'Date'])`
  - [Gotcha watch:] `df.set_index()` *returns a new DataFrame* by default.
- Sort a MultiIndex to enable slicing by range:
  - `df.sort_index()`
- "Fancy" indexing is passing slice ranges as a *list*:
  - `stocks.loc[(['AAPL', 'MSFT'], '2016-10-05'), :]`
  - `stocks.loc[('CSCO', ['2016-10-05', '2016-10-03']), :] `
- [Gotcha watch:] Use `slice()` to force support for *both rows and columns* when needed (because the colon between ranges is not natively supported with tuples):
  - `stocks.loc[(slice(None), slice('2016-10-03', '2016-10-04')), :]`
  - `stocks.loc[(slice(None), '2016-10-05')), :]`
  
## Stacking & unstacking DataFrames
Allows you fine control of pivoted DF where a MultiIndex is already present.

Unstack a key in a MultiIndex (moves a key from the index to a column):
- `trials.unstack(level='gender')` or `trials.unstack(level=1)`
- Note this results in hierarchical columns as *tuples*.

Stack a key in a MultiIndex (moves a column to the index):
- `trials.stack(level='gender')`

Swap order of keys with:
- `df_multiindexed.swaplevel(0, 1)`
- Then re-sort with `df_multiindexed.sort_index()`

## Groupby and Categoricals
- `df.groupby('column').count()` -- split into groups of rows by distinct values of given column then apply count() aggregation function to combine each group.
  - "split-apply-combine" paradigm
  - `groupby()` returns a `pandas.core.groupby.DataFrameGroupBy` object.
  - `DataFrameGroupBy.groups` is a *dictionary* (which can be handy to see the split groups).
- Can also choose *columns to apply the aggretation to*:
  - `df.groupby('column')['apply_col'].count()`
  - `sales.groupby('city')[['bread', 'butter']].sum()`

with `MultiIndex`:
- `gapminder.groupby(level=['Year', 'region'])`
  
The **`category`** data type can be handy; it reduces memory needed, and speeds up operations like `groupby()`:
- `df['column'] = df['column'].astype('category')` 

Multiple aggregations at once:
- `.agg(['sum', 'max', 'count', 'mean'])`
- `.agg(custom_function)`

You can also pass *dictionaries* to `agg()` to apply aggregations per column:
- `.agg({'bread':'sum', 'butter':custom_function})`
  
An **aggregation** (`.agg()`) does a *reduction*, while a **transformation** (`.transform()`) *applies a function element-wise to a sequence*.

An **`apply()`** is used for more complex transforms which don't fit into either of these paradigms.

Using **z-scores** is a great way to identify outliers:
- `standardized = gapminder_2010.groupby('region')['life','fertility'].transform(zscore)`

## Groupby and Filtering
Filter a `groupby()` group using *iteration*:
```
for group_name, group in splitting: 
    chevy_mean = group.loc[group['name'].str.contains('chevrolet'), 'mpg'].mean()
    chevy_means[group_name] = avg
```

Or by using a *dictionary comprehension*:
```
chevy_means = {year:group.loc[group['name'].str.contains('chevrolet'), 'mpg'].mean(): for year, group in splitting}
```

And you can do a comparison between all and a filtered group using a boolean mask:
```
chevy = auto['name'].str.contains('chevrolet')
auto.groupby(['yr', chevy])['mpg'].mean() 
```

`DataFrame.filter()` - Subset rows or columns of a dataframe according to *labels* in the specified index. Does NOT filter by *content*.
