# Part 1: familiarise with data tools

## Reading data using pandas

[**Pandas:**](http://pandas.pydata.org/) popular Python library for data exploration, manipulation, and analysis

- Anaconda users: use the provided [environment.yml](environment.yml)
- Other users: [installation instructions](https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html)

In [None]:
# conventional way to import pandas
import pandas as pd

In [None]:
# read CSV file from the 'data' subdirectory using a relative path
ads = pd.read_csv('data/Advertising.csv', index_col=0)

# display the first 5 rows
ads.head()

Primary object types:

- **DataFrame:** rows and columns (like a spreadsheet)
- **Series:** a single column

In [None]:
# display the last 5 rows
ads.tail()

In [None]:
# check the shape of the DataFrame (rows, columns)
ads.shape

## Visualizing data using seaborn

[**Seaborn:**](http://seaborn.pydata.org/) Python library for statistical data visualization built on top of Matplotlib

- Anaconda users: use the provided [environment.yml](environment.yml)
- Other users: [installation instructions](http://seaborn.pydata.org/installing.html)

In [None]:
# conventional way to import seaborn
import seaborn as sns

# allow plots to appear within the notebook
%matplotlib inline

View relationship between features using [scatterplots](http://seaborn.pydata.org/generated/seaborn.pairplot.html). The `reg` option fits linear regression models to the scatter plots.

In [None]:
# visualize the relationship between the features and the response using scatterplots
sns.pairplot(ads, x_vars=['TV','radio','newspaper'], y_vars=['sales'], height=5, aspect=0.7, kind='reg')

## Using Pandas to manage datasets

To explore pandas we will use the [Titanic dataset](https://www.kaggle.com/c/titanic), available in the folder `data/titanic`.

In [None]:
titanic = pd.read_csv("data/titanic/train.csv")

You can use `loc` and `iloc` methods to [select elements](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#different-choices-for-indexing) of `Serie`s or `DataFrame`s.

In [None]:
# select rows
titanic.loc[:10]

#### About the data
Here's some of the columns
* Name - a string with person's full name
* Survived - 1 if a person survived the shipwreck, 0 otherwise.
* Pclass - passenger class.
* Sex - a person's gender (in those good ol' times when there were just 2 of them)
* Age - age in years, if available
* Sibsp - number of siblings on a ship
* Parch - number of parents on a ship
* Fare - ticket cost
* Embarked - port where the passenger embarked
 * C = Cherbourg; Q = Queenstown; S = Southampton

In [None]:
# table dimensions
print("len(titanic) = ", len(titanic))
print("titanic.shape = ", titanic.shape)

In [None]:
# select a single column.
titanic["Age"].loc[:10] # alternatively: titanic.Age

In [None]:
# select several columns and rows at once
titanic[["Fare","Pclass"]].loc[5:10]

In [None]:
# select passengers of rows 13 and 666 - did they survive?

# <YOUR CODE >

Pandas provides different [aggregate functions](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#descriptive-statistics).

In [None]:
# compute the overall survival rate (what fraction of passengers survived the shipwreck)

# <YOUR CODE >

Pandas also has some basic data analysis tools. For one, you can quickly display statistical aggregates for each column using `.describe()`

In [None]:
titanic.describe()

Some columns contain __NaN__ values - this means that there is no data there. For example, passenger `#5` has unknown age. To simplify the future data analysis, we'll replace NaN values by using pandas `fillna` function.

_Note: we do this so easily because it's a tutorial. In general, you think twice before you modify data like this._

In [None]:
titanic.iloc[5]

In [None]:
titanic['Age'] = titanic['Age'].fillna(value=titanic['Age'].mean())
titanic['Fare'] = titanic['Fare'].fillna(value=titanic['Fare'].mean())

In [None]:
titanic.iloc[5]

Pandas provides different functions to modify the dataset, some examples are

In [None]:
# convert using `map`
titanic['Sex'] = titanic['Sex'].map({'male': 0, 'female': 1})
titanic[["Name","Sex"]].loc[:5]

In [None]:
# cast a column
titanic['Age'] = titanic['Age'].astype(int)
titanic[['Name','Age']].loc[:5]

In [None]:
# Add a calculated column
titanic.loc[(titanic['Age'] >= 18), 'Underage'] = 0
titanic.loc[(titanic['Age'] < 18), 'Underage'] = 1
titanic[['Name','Age','Underage']].sample(10)

Numerical data can be [discretised](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html?highlight=qcut#discretization-and-quantiling).

In [None]:
# Discretise numerical data using quantiles
titanic['Fare_category'] = pd.qcut(titanic['Fare'], 5, labels=range(5))
titanic[['Name','Fare','Fare_category']].sample(10)

In [None]:
# removing columns
X_train = titanic.drop('Survived', axis=1)
y_train = titanic['Survived']
X_train.info()

In [None]:
# drop doesn't modify the original dataset
titanic.info()

In [None]:
# create new datasets from existing ones and/or series
some_values = pd.concat([titanic[['Name', 'Underage']], y_train], axis=1).sample(10)
some_values

In [None]:
# convert to CSV
print(some_values.to_csv(index=False))

More pandas: 
* Official [tutorials](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html), including this [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)
* Bunch of cheat sheets awaits just one google query away from you (e.g. [basics](https://www.datacamp.com/community/blog/python-pandas-cheat-sheet), [combining datasets](https://pbs.twimg.com/media/C65MaMpVwAA3v0A.jpg) and so on). 

## Numpy and vectorized computing

Almost any machine learning model requires some computational heavy lifting usually involving linear algebra problems. Unfortunately, raw python is terrible at this because each operation is interpreted at runtime. 

So instead, we'll use [**numpy**](https://docs.scipy.org/doc/numpy/user/quickstart.html) - a library that lets you run blazing fast computation with vectors, matrices and other tensors. Again, the god object here is `numpy.ndarray`:

In [None]:
import numpy as np

a = np.array([1, 2, 3, 4, 5])
b = np.array([5, 4, 3, 2, 1])
print("a = ", a)
print("b = ", b)

# math and boolean operations can applied to each element of an array
print("a + 1 =", a + 1)
print("a * 2 =", a * 2)
print("a == 2", a == 2)
# ... or corresponding elements of two (or more) arrays
print("a + b =", a + b)
print("a * b =", a * b)

In [None]:
# Your turn: compute half-products of a and b elements (halves of products)

# <YOUR CODE >

In [None]:
# compute elementwise quotient between squared a and (b plus 1)
# <YOUR CODE >

There's also a bunch of pre-implemented operations including logarithms, trigonometry, vector/matrix products and aggregations.

In [None]:
a = np.array([1, 2, 3, 4, 5])
b = np.array([5, 4, 3, 2, 1])
print("numpy.sum(a) = ", np.sum(a))
print("numpy.mean(a) = ", np.mean(a))
print("numpy.min(a) = ",  np.min(a))
print("numpy.argmin(b) = ", np.argmin(b))  # index of minimal element
# dot product. Also used for matrix/tensor multiplication
print("numpy.dot(a,b) = ", np.dot(a, b))
print("numpy.unique(['male','male','female','female','male']) = ", np.unique(
    ['male', 'male', 'female', 'female', 'male']))

# and tons of other stuff. see http://bit.ly/2u5q430 .

The important part: all this functionality works with dataframes:

In [None]:
print("Max ticket price: ", np.max(titanic["Fare"]))
# print("\nThe guy who paid the most:\n", titanic.loc[np.argmax(titanic["Fare"])])
print("\nThe guy who paid the most:\n", titanic.loc[titanic["Fare"].idxmax()])

In [None]:
# your code: compute mean passenger age and the oldest guy on the ship

# <YOUR CODE >

In [None]:
print("Boolean operations")

print('a = ', a)
print('b = ', b)
print("a > 2", a > 2)
print("numpy.logical_not(a>2) = ", np.logical_not(a > 2))
print("numpy.logical_and(a>2,b>2) = ", np.logical_and(a > 2, b > 2))
print("numpy.logical_or(a>4,b<3) = ", np.logical_or(a > 2, b < 3))

print("\n shortcuts")
print("~(a > 2) = ", ~(a > 2))  # logical_not(a > 2)
print("(a > 2) & (b > 2) = ", (a > 2) & (b > 2))  # logical_and
print("(a > 2) | (b < 3) = ", (a > 2) | (b < 3))  # logical_or

The final numpy feature we'll need is indexing: selecting elements from an array. 
Aside from python indexes and slices (e.g. a[1:4]), numpy also allows you to select several elements at once.

In [None]:
a = np.array([0, 1, 4, 9, 16, 25])
ix = np.array([1, 2, 5])
print("a = ", a)
print("Select by element index")
print("a[[1,2,5]] = ", a[ix])

print("\nSelect by boolean mask")
# select all elementts in a that are greater than 5
print("a[a > 5] = ", a[a > 5])
print("(a % 2 == 0) =", a % 2 == 0)  # True for even, False for odd
print("a[a > 3] =", a[a % 2 == 0])  # select all elements in a that are even


# select male children
print("titanic[(titanic['Age'] < 18) & (titanic['Sex'] == 0)] = (below)")
titanic[(titanic['Age'] < 18) & (titanic['Sex'] == 0)]

### Your turn

Use numpy and pandas to answer a few questions about data

In [None]:
# who on average paid more for their ticket, men or women?

mean_fare_men = 0 # <YOUR CODE >
mean_fare_women = 0 # <YOUR CODE >

print('Average fare: men={}, women={}'.format(mean_fare_men, mean_fare_women))

In [None]:
# who is more likely to survive: a child (<18 yo) or an adult?

child_survival_rate = 0 # <YOUR CODE >
adult_survival_rate = 0 # <YOUR CODE >

print('Survival rate: underage={}, adults={}'.format(child_survival_rate, adult_survival_rate))