<a href="https://colab.research.google.com/github/shlear/MLDM-2022/blob/main/01-intro/DataHandling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

During the practical sessions of the course we are going to use [Python programming language](https://www.python.org) in the [Google Colab environment](https://colab.research.google.com). Alternatively you can download some other python distribution, e.g. [anaconda](https://www.anaconda.com/) and run jupyter locally (see the [docs](https://jupyter.readthedocs.io/en/latest/running.html) for more info).

In [None]:
!python --version
123

If you are new to Python, please consider reading through the following tutorial:
 - https://docs.python.org/3.7/tutorial/

In particular, the following parts of it should provide a more or less comprehensive introduction to the must-know basics:
   - https://docs.python.org/3.7/tutorial/introduction.html
   - https://docs.python.org/3.7/tutorial/controlflow.html
   - https://docs.python.org/3.7/tutorial/datastructures.html
   - https://docs.python.org/3.7/tutorial/modules.html
   - https://docs.python.org/3.7/tutorial/classes.html

Don't forget to follow [PEP-8](https://peps.python.org/pep-0008/). You may also check other[style guides](https://google.github.io/styleguide/pyguide.html).

# Welcome

An overview of basic features of the Google Colab environment can be found [here](https://colab.research.google.com/notebooks/basic_features_overview.ipynb).

# Tabular Playground Series

![Tabular Playground Series](https://storage.googleapis.com/kaggle-competitions/kaggle/33101/logos/header.png?t=2021-12-30-01-23-41)

This notebook's gonna teach you to use the basic data science stack for python: jupyter, numpy, matplotlib and sklearn.

We are going to use [Tabular Playground Series](https://www.kaggle.com/competitions?searchQuery=Tabular+Playground+Series) as the main data sourse for the experiments. 

"These competitions are a great choice for people looking for something in between the Titanic Getting Started competition and the Featured competitions."

## Part I: Jupyter notebooks recap

This whole document you are looking at right now is a **jupyter notebook**. You can think of jupyter as of a browser-friendly python development environment.

For each notebook there's a python interpreter running behind the scenes, also called a **kernel**. The notebook consists of **cells** - either *code* cells, or *text* cells. E.g. this text you're reading is in a text cell.

An example of a code cell can be found below. You can execute its code by placing the coursor in it and hitting `Shift + Enter`.

__please keep running all the code cells as you read__

In [None]:
print('Hellow world')

Note that same python session is used to run the code from different cells. So, for example, by defining a variable in one cell, you can re-use it in another:

In [None]:
some_number = 42

In [None]:
some_number**2

Jupyter allows you to run cells in an arbitrary order, which may make your code a bit messy and complicated to debug. In general it's a good practice to write your code such that it successfully runs from top to bottom in a clean environment. To reset your environment back to a clean state click `Runtime -> Restart runtime` (in regular jupter: `Kernel -> Restart`).

**The most important feature** of jupyter notebooks for this course: 
* contextual help: the behaviour depends on whether you're running this in google colab or in regular jupyter.
* In colab the suggestions / documentation will appear automatically as you type.
* In regular jupyter if you're typing something, press `Tab` to see automatic suggestions / `Shift + Tab` for function documentation.

You can use [Markdown](https://jupyter.brynmawr.edu/services/public/dblank/Jupyter%20Notebook%20Users%20Manual.ipynb#4.-Using-Markdown-Cells-for-Writing) and [LaTeX](https://stackoverflow.com/questions/13208286/how-to-write-latex-in-ipython-notebook) through the cells as well.

*Note: here we'll assume you're using google colab*

In [None]:
# run this first
import math

In [None]:
# Place your cursor at the end of the unfinished line below and 
# type in '.' to see the contextual help and
# find a function that computes arctangent from two parameters (should
# have 2 in it's name).
# Once you chose it, put an opening bracket character to
# see the docs.

math  # <--- type in a '.' symbol to see suggestions

## Part II: Numpy and vectorized computing

Almost any machine learning model requires some computational heavy lifting usually involving linear algebra problems. Unfortunately, raw python is terrible at this because each operation is interpreted at runtime. 

So instead, we'll use `numpy` - a library that lets you run blazing fast computation with vectors, matrices and other tensors. It's written in lower-level programming languages like C or Fortran and only uses python as an interface.

Quoting [documentation](https://numpy.org/devdocs/user/quickstart.html):
> NumPy’s main object is the homogeneous multidimensional array. It is a table of elements (usually numbers), all of the same type, indexed by a tuple of non-negative integers. In NumPy dimensions are called axes.




This object is called `numpy.ndarray` ("nd" standing for "N-dimensional"). It is also aliased to `numpy.array` and consists of two major components: the raw array data (from now on, referred to as the data buffer), and the information about the raw array data.

The data buffer is typically what people think of as arrays in C or Fortran, a contiguous (and fixed) block of memory containing fixed-sized data items. NumPy also contains a significant set of data that describes how to interpret the data in the data buffer. This extra information contains (among other things):

* The basic data element’s size in bytes.

* The start of the data within the data buffer (an offset relative to the beginning of the data buffer).

* The number of dimensions and the size of each dimension.

* The separation between elements for each dimension (the stride). This does not have to be a multiple of the element size.

* The byte order of the data (which may not be the native byte order).


In [None]:
import numpy as np

a = np.array([1,2,3,4,5])
b = np.array([5,4,3,2,1])
print("a = ", a)
print("b = ", b)

# math and boolean operations can applied to each element of an array
print("a + 1 =", a + 1)
print("a * 2 =", a * 2)
print("a == 2", a == 2)
# ... or corresponding elements of two (or more) arrays
print("a + b =", a + b)
print("a * b =", a * b)

**All the solutions you share can give you additional points that can be added to HW's ones.**

In [None]:
# Your turn: compute half-products of a and b elements (halves of products)
def half_product(a, b):
    raise NotImplementedError

In [None]:
half_product(a,b)

In [None]:
# compute elementwise quotient between squared a and (b plus 1)
def quotient(a,b):
    raise NotImplementedError

In [None]:
quotient(a, b)

```

```

```

```

```

```

```

```

```

```

There's a number of functions to create arrays of zeros, ones, ascending/descending numbers etc.:

In [None]:
np.zeros(shape=(3, 4))

In [None]:
np.ones(shape=(2, 5), dtype=np.bool)

In [None]:
np.arange(3, 15, 2) # start, stop, step

In [None]:
np.linspace(0, 10, 11) # divide [0, 10] interval into 11 points

In [None]:
np.logspace(1, 10, 10, base=2, dtype=np.int64)

You can easily reshape arrays:

In [None]:
np.arange(24).reshape(2, 3, 4)

The `strides` of an array tell us how many bytes we have to skip in memory to move to the next position along a certain axis. 

In [None]:
d = np.arange(12).reshape(2, -1)
print(d)
print(d.strides)
print((d.shape[1] * d.dtype.itemsize, d.dtype.itemsize))
d = d.reshape(-1, 2)
print(d)
d.strides

or add dimensions of size 1:

In [None]:
print(np.arange(3)[:, np.newaxis])
print('---')
print(np.arange(3)[np.newaxis, :])

#### Or similarly:

# print(np.arange(3)[:, None])
# print('---')
# print(np.arange(3)[None, :])

Such dimensions are automatically [broadcast](https://numpy.org/doc/stable/user/basics.broadcasting.html) when doing mathematical operations:

In [None]:
print(np.arange(3)[:, np.newaxis] + np.zeros(shape=(3, 3), dtype=int))
print()
print(np.arange(3)[np.newaxis, :] + np.zeros(shape=(3, 3), dtype=int))

When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing (i.e. rightmost) dimensions and works its way left. Two dimensions are compatible when:

*   they are equal, or
*   one of them is 1

If these conditions are not met, a `ValueError`: operands could not be broadcast together exception is thrown, indicating that the arrays have incompatible shapes. The size of the resulting array is the size that is not 1 along each axis of the inputs.

When either of the dimensions compared is one, the other is used. In other words, dimensions with size 1 are stretched or “copied” to match the other.

![broadcasting_2[1].png](https://numpy.org/doc/stable/_images/broadcasting_2.png)



There is also a number of ways to stack arrays together:

In [None]:
matrix1 = np.arange(50).reshape(10, 5)
matrix2 = -np.arange(20).reshape(10, 2)

np.concatenate([matrix1, matrix2], axis=1)

In [None]:
A = matrix1[:,0]
B = matrix2[:,0]

print(A)
print('---')
print(B)
print('---')
print(np.stack([A, B], axis=1))



Any matrix can be transposed easily:

In [None]:
print(matrix2)
print('---')
print(matrix2.T)

You don't create a copy of your data, but you change the `strides`

In [None]:
print('matrix2.shape =', matrix2.shape, ' strides =', matrix2.strides)
print('matrix2.T.shape =', matrix2.T.shape, ' strides =', matrix2.T.strides)

In [None]:
# Your turn: make a (7 x 5) matrix with e_ij = i
# (i - row number, j - column number)
#
# Avoid using loops.

<YOUR CODE>

### How fast is it?

Let's compare computation time for python and numpy
* Two arrays of 10^6 elements
 * first - from 0 to 1 000 000
 * second - from 99 to 1 000 099
 
* Computing:
 * elemwise sum
 * elemwise product
 * square root of first array
 * sum of all elements in the first array
 

In [None]:
%%time 
# ^-- this "magic" measures and prints cell computation time

# Option I: pure python
arr_1 = range(1000000)
arr_2 = range(99,1000099)


a_sum = []
a_prod = []
sqrt_a1 = []
for i in range(len(arr_1)):
    a_sum.append(arr_1[i]+arr_2[i])
    a_prod.append(arr_1[i]*arr_2[i])
    a_sum.append(arr_1[i]**0.5)
    
arr_1_sum = sum(arr_1)


In [None]:
%%time

# Option II: start from python, convert to numpy
arr_1 = range(1000000)
arr_2 = range(99,1000099)

arr_1, arr_2 = np.array(arr_1) , np.array(arr_2)


a_sum = arr_1 + arr_2
a_prod = arr_1 * arr_2
sqrt_a1 = arr_1 ** .5
arr_1_sum = arr_1.sum()


In [None]:
%%time

# Option III: pure numpy
arr_1 = np.arange(1000000)
arr_2 = np.arange(99,1000099)

a_sum = arr_1 + arr_2
a_prod = arr_1 * arr_2
sqrt_a1 = arr_1 ** .5
arr_1_sum = arr_1.sum()


If you want more serious benchmarks, take a look at [this](http://brilliantlywrong.blogspot.ru/2015/01/benchmarks-of-speed-numpy-vs-all.html).

```

```

```

```

```

```

```

```

```

```

```

```

```

```

### Other numpy functions and features

There's also a bunch of pre-implemented operations including logarithms, trigonometry, vector/matrix products and aggregations.

In [None]:
a = np.array([1,2,3,4,5])
b = np.array([5,4,3,2,1])
print("numpy.sum(a) = ", np.sum(a))
print("numpy.mean(a) = ", np.mean(a))
print("numpy.min(a) = ",  np.min(a))
print("numpy.argmin(b) = ", np.argmin(b))  # index of minimal element
print("numpy.dot(a,b) = ", np.dot(a, b))      # dot product. Also used for matrix/tensor multiplication
print("numpy.unique(['male','male','female','female','male']) = ", np.unique(['male','male','female','female','male']))

# and tons of other stuff. see http://bit.ly/2u5q430 .

In [None]:
# most of this functions are also implemented as members of numpy arrays, e.g.:
print('a.min() =', a.min())
print('a.mean() =', a.mean())

In [None]:
print("Boolean operations")

print('a = ', a)
print('b = ', b)
print("a > 2", a > 2)
print("numpy.logical_not(a>2) = ", np.logical_not(a>2))
print("numpy.logical_and(a>2,b>2) = ", np.logical_and(a > 2,b > 2))
print("numpy.logical_or(a>2,b<3) = ", np.logical_or(a > 2, b < 3))

print("\n shortcuts")
print("~(a > 2) = ", ~(a > 2))                    #logical_not(a > 2)
print("(a > 2) & (b > 2) = ", (a > 2) & (b > 2))  #logical_and
print("(a > 2) | (b < 3) = ", (a > 2) | (b < 3))  #logical_or

Another numpy feature we'll need is indexing: selecting elements from an array. 
Aside from python indexes and slices (e.g. a[1:4]), numpy also allows you to select several elements at once.

In [None]:
a = np.arange(24).reshape(4, 6)
print(a)
print('---')
print(a[1:3,0:6:2])

In [None]:
a = np.array([0, 1, 4, 9, 16, 25])
ix = np.array([1,2,5])
print("a = ", a)
print("Select by element index")
print("a[[1,2,5]] = ", a[ix])

print("\nSelect by boolean mask")
print("a[a > 5] = ", a[a > 5])     # select all elements in a that are greater than 5
print("(a % 2 == 0) =", a % 2 == 0) # True for even, False for odd
print("a[a % 2 == 0] =", a[a % 2 == 0]) # select all elements in a that are even

## Part III: Loading data with Pandas

Pandas is a library that helps you load the data, prepare it and perform some lightweight analysis. The god object here is the `pandas.DataFrame` - a 2d table with batteries included (it actually runs numpy under the hood).

Let's donwload the data and perform a small **E**xploratory **D**ata **A**nalysis.

The most convinient and proper way to do it that allows you to avoid using your local storage looks as follows:

1. Go to your Kaggle account, Scroll to API section and Click on Create New API Token - It will download kaggle.json file on your **local** machine.

2. Upload it to the Google Colab env.

3. Move it using comands bellow. 

In [None]:
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets list -s tabular-playground-series

Download the data using the name of the competition you want to work with. Don't forget to accept the rules of the competition. 

In [None]:
!kaggle competitions download -c 'tabular-playground-series-aug-2022'

In case you don't have Kaggle account or you just want to avoid the steps mentioned above, you can just download it from the course repository. 

In [None]:
# !wget https://raw.githubusercontent.com/HSE-LAMBDA/MLDM-2022/main/01-intro/tabular-playground-series-aug-2022.zip

In [None]:
!unzip /content/tabular-playground-series-aug-2022.zip -d data

Now let's use Pandas to read the data.

In [None]:
import pandas as pd
data = pd.read_csv('/content/data/train.csv',
                    index_col='id') # this yields a pandas.DataFrame
data.index += 1

In [None]:
# Take a look at the data

data.head() # selects top 5 lines

In [None]:
pd.options.display.max_rows = 10
data

In [None]:
data.sample(5)

#### About the data
This data represents the results of a large product testing study. For each `product_code` you are given a number of product `attributes` (fixed for the code) as well as a number of `measurement` values for each individual product, representing various lab testing methods. Each product is used in a simulated real-world environment experiment, and and absorbs a certain amount of fluid (`loading`) to see whether or not it fails.

Your task is to use the data to predict individual product failures of new codes with their individual lab test results. The training data includes the target `failure` and you need to predict the likelihood each test id will experience a failure.

In [None]:
# table dimensions
print("len(data) = ", len(data))
print("data.shape = ", data.shape)

In [None]:
# select a single row
print(data.loc[4])

In [None]:
# select a single column.
loadings = data["loading"] # alternatively: data.loading
print(loadings.loc[:10])

In [None]:
# select several columns and rows at once
data.loc[5:10, ("loading", "failure")]    # alternatively: data[["loading","failure"]].loc[5:10]

### `loc` vs `iloc`

There are two ways of indexing the rows in pandas:
 *   by index column values (`id` in our case) – use `data.loc` for that
 *   by positional index - use `data.iloc` for that

Note that index column starts from 1, so positional index 0 will correspond to index column value 1, positional 1 to index column value 2, and so on:

In [None]:
print(data.index)
print('------')
print("data.iloc[0]:")
print(data.iloc[0])
print('------')
print("data.loc[1]:")
print(data.loc[1])

Also note that when indexing with `.loc` both slice ends are included:

In [None]:
data.loc[2:3]

while with `.iloc` the end is excluded:

In [None]:
data.iloc[1:2]

More complicated indexing (similar to boolean indexing in numpy):

In [None]:
data.loc[(data['loading'] < 90) & (data['product_code'] == np.random.choice(data.product_code.unique()))]

In [None]:
data.query('2.1 * measurement_9 < measurement_10 and loading < 111')

### Your turn:


In [None]:
# select studies number 13 and 666 in a single line - did they fail?

<YOUR CODE>

In [None]:
# compute the overall fail-rate (what fraction of studies failed)
# do we face a balanced train dataset?

<YOUR CODE>

```

```

```

```

```

```

```

```

```

```

```

```

```

```



Pandas also has some basic data analysis tools. For one, you can quickly display statistical aggregates for each column using `.describe()`

In [None]:
data.describe()

Some columns contain __NaN__ values - this means that there is no data there. For example, study `#26565` has unknown `measurement_16`. To simplify the future data analysis, we'll replace NaN values by using pandas `fillna` function.

_Note: we do this so easily because it's a tutorial. In general, you think twice before you modify data like this._

In [None]:
data['measurement_9'] = data['measurement_9'].fillna(value=data['measurement_9'].mean())
data['measurement_16'].fillna(value=data['measurement_16'].mean(), inplace = True)

In [None]:
data.iloc[26565]

### Pandas + numpy

The important part: as pandas uses numpy under the hood, most of numpy functionality works with dataframes, as you can get their numpy representation with `.values` (most numpy functions will even work on pure pandas objects):

In [None]:
# calling np.max on a pure pandas column:
column_name = 'measurement_17'
print("Max {}: ".format(column_name), np.max(data[column_name]))

# calling np.argmax on a numpy representation of a pandas column
# to get its positional index:
print("\nThe study with the max " + column_name + ":\n",
      data.iloc[
          np.argmax(data[column_name].values)
      ])

In [None]:
# numpy works only with positional index:
column_name = 'measurement_16'
print(data[column_name].values.argmax())
#     ^^^^^^^^^^^^^^^^^^^
#     this part returns a numpy array, argmax of which we are calculating


# in pandas you can ask for the index (i.e. value of the index column)
# of the maximal element like this:
print(data[column_name].idxmax())

### Your turn

Use numpy and pandas to answer a few questions about data

In [None]:
# your code: compute mean loading and find max one.
<YOUR CODE>

In [None]:
# which product code is more likely to fail?
<YOUR CODE>

More pandas: 
* Official [tutorials](https://pandas.pydata.org/pandas-docs/stable/tutorials.html), including this [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/10min.html#min)
* Bunch of cheat sheets awaits just one google query away from you (e.g. [basics](http://datacamp-community-prod.s3.amazonaws.com/dbed353d-2757-4617-8206-8767ab379ab3), [combining datasets](https://pbs.twimg.com/media/C65MaMpVwAA3v0A.jpg) and so on). 

## Part IV: plots and matplotlib

Using python to visualize the data is covered by yet another library: `matplotlib`.

Just like python itself, matplotlib has an awesome tendency of keeping simple things simple while still allowing you to write complicated stuff with convenience (e.g. super-detailed plots or custom animations).

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline  
# ^-- this "magic" selects specific matplotlib backend suitable for
# jupyter notebooks. For more info see:
# https://ipython.readthedocs.io/en/stable/interactive/plotting.html#id1
# (actually it's the default in google colab)

#scatter-plot
x = np.arange(5)
print("x =", x)
print("x**2 =", x**2)
print("plotting x**2 vs x:")
plt.scatter(x, x**2)
plt.show()  # show the first plot to begin drawing the next one

plt.plot(x, x**2);

In [None]:
# histogram - showing data density
float_cols = [c for c in data.columns if data[c].dtype == float]

_, axs = plt.subplots(2, 4, figsize=(10,5))

for f, ax in zip(float_cols[:8], axs.ravel()):
    bins = np.linspace(min(data[f]), max(data[f]), 50)
    ax.hist(data[f], bins=bins, density=True)
    ax.set_xlabel(f)

plt.tight_layout(w_pad=1)
plt.suptitle('Distributions of the continuous features', fontsize=15, y=1.12)
plt.show()

In [None]:
# or you can use inbuilt methods and combine it with pyplot
data.failure.hist()
plt.title('Failure hist')
plt.show()

In [None]:
# .plot() method allows you to plot in different styles
data.product_code.value_counts().plot(kind='pie', autopct="%1.1f%%",shadow=True, 
        startangle=45, explode=[0.065] * data.product_code.nunique());

In [None]:
import seaborn as sns


In [None]:
_, axs = plt.subplots(2, 2, figsize=(19, 19))
for product, ax in zip(np.unique(data.product_code)[:4], axs.ravel()):
    corr = data[float_cols + ['measurement_0', 'measurement_1', 'measurement_2']][data.product_code == product].corr()
    mask = np.triu(np.ones_like(corr, dtype=bool))
    sns.heatmap(corr, mask=mask, fmt='0.2f', 
                annot=True, cmap='Purples',  ax=ax, cbar=False)
    ax.set_title(product)
plt.tight_layout(w_pad=0.8)
plt.show()

In [None]:
# plot a barplot of number of missing values in columns
<YOUR CODE>

# hint: use data.isnull() method

* Extended [tutorial](https://matplotlib.org/2.0.2/users/pyplot_tutorial.html)
* Other libraries for more sophisticated stuff: [Plotly](https://plot.ly/python/), and [Bokeh](https://bokeh.pydata.org/en/latest/)

## Part V (final): machine learning with scikit-learn

<img src='https://i.redd.it/k13eojlo31i31.png' width=500px>

Scikit-learn is the tool for simple machine learning pipelines. 

It's a single library that unites a whole bunch of models under the common interface:
* Create:__ `model = sklearn.whatever.ModelNameHere(parameters_if_any)`__
* Train:__ `model.fit(X,y)`__
* Predict:__ `model.predict(X_test)`__

It also contains utilities for feature extraction, quality estimation or cross-validation.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

x = data[float_cols].copy()
y = data["failure"]

model = KNeighborsClassifier(n_neighbors=5)

# split the data into train(90%) and test(10%)
train_ids = ...
test_ids = ...

# fit the model
model.fit(...,... )

# make the prediction
test_predictions = model.predict(...)
print("Test accuracy:", accuracy_score(..., test_predictions))


Try to tune n_neighbors and add play around features to boost accuracy

* Sklearn [tutorials](http://scikit-learn.org/stable/tutorial/index.html)
* Sklearn [examples](http://scikit-learn.org/stable/auto_examples/index.html)
* Sklearn [cheat sheet](http://scikit-learn.org/stable/_static/ml_map.png)

```
```
```
```

## Bonus part

### Pandas: adding new columns

In [None]:
#!wget https://raw.githubusercontent.com/HSE-LAMBDA/MLDM-2022/main/01-intro/train.csv #use Titanic Data for examples below

To define a new column in a dataframe simply assign to it (if such a column exists it will get overwritten):

In [None]:
data['CabinUnknown'] = data.Cabin.isna()
data.head()

Be sure to use the approach with a `['ColumnName']` , rather than `.ColumnName`, otherwize it won't work:

In [None]:
data.this_will_not_work = data.Age**2
data.head()

### Pandas: one-hot encoding

In [None]:
pd.get_dummies(data.Embarked, prefix='Embarked').head()
# added .head() for a more compact output

### Pandas: merging tables

In [None]:
data_extended = pd.concat([
                      data,
                      pd.get_dummies(data.Embarked, prefix='Embarked')
                    ], axis=1)
data_extended.head()

### Pandas: groupby

This function provides a neat way to calculate some statistics for groups of entries with some common feature value.

In [None]:
g = data.groupby('Embarked')
# Now `g` is an iterable of dataframes split based on the values
# in the 'Embarked' column:

for embarked, group in g:
  print(embarked, type(group), group.shape)

In [None]:
# You can calculate things on the groups simultaniously:

g.mean()

In [None]:
g.count() # this calculates the number of valid entries (excluding nans)

In [None]:
# You can also access individual columns:
g.Fare.max()

### Pandas: cut and qcut

These functions let us split data into bins: `cut` makes linear splits, while `qcut` makes quantile-based splits. They both return a column of bins to which current entry belongs:

In [None]:
pd.cut(data.Age, 3).head() # '.head()' added for a more compact output

In [None]:
pd.qcut(data.Age, 3).head() # '.head()' added for a more compact output

### Your turn

Use `cut` and `groupby` to calculate survival rate for 3 age categories.

**Hint:** you need to add the result of `cut` as a new column

In [None]:
<YOUR CODE HERE>

### Pandas: combining the tricks (survival vs ticket fare)

In [None]:
from matplotlib.ticker import ScalarFormatter

data['qFare'] = pd.qcut(data.Fare, 20)

sur_vs_price = data.groupby('qFare').Survived.mean()
sur_vs_price_e = data.groupby('qFare').Survived.std() \
                        / data.groupby('qFare').Survived.count()**0.5

fig = plt.figure(figsize=(16, 9))
plt.errorbar(x=sur_vs_price.index.categories.mid,
             y=sur_vs_price.values,
             yerr=sur_vs_price_e.values,
             xerr=(
                 sur_vs_price.index.categories.right - 
                 sur_vs_price.index.categories.left
               ) / 2,
             fmt='o')
plt.gca().set_xscale('log')
plt.gca().xaxis.set_major_formatter(ScalarFormatter())
plt.gca().set_xticks(
              list(range(3, 10)) +
              list(range(10, 100, 10)) +
              list(range(100, 700, 100))
            )

plt.xlabel('Fare')
plt.ylabel('Survival probability');