### Course 

1. Python Core
    + The Course Overview
    + Dates and Times
    + List Comprehensions
    + Python Core Concepts and Data Types
    + Understanding Iterables
    + Accessing Raw Data
2. NumPy for Array Computation
    + Creating NumPy Arrays
    + Reshaping, Indexing, and Slicing
    + Basic Stats and Linear Algebra 
3. Pandas for Data Frames
    + Essential Operations with Data Frames
    + Getting Started with Pandas
    + Summary Statistics from a Data Frame
    + Data Aggregation over a Data Frame
4. Exercise - Titanic Survivor Analysis
    + Performing Supervised Learning with Scikit-Learn
    + Predicting Titanic survival - A Supervised Learning Problem
    + Exercise - Titanic Survivor Analysis

### Python Core Concepts and Data Types

This section describes some Python core concepts, relevant syntax and built-in data types, Variables and built-in types.

In [None]:
my_name = 'Marco'
x = 1
y = 10


print("Hello my name is {}".format(my_name))

In [None]:
type(my_name)

In [None]:
type(x)

In [None]:
isinstance(my_name, str)  # (obj_to_check, given_type)

In [None]:
isinstance(my_name, int)

In [None]:
isinstance(my_name, object)

#### Boolean

In [None]:
a = True
b = False


type(a)

#### Comparisons

In [None]:
a == b

In [None]:
a != b

In [None]:
a == 1  # a is True

In [None]:
b == 0  # b is False

In [None]:
1 < 2  # also, 2 > 1

In [None]:
1 <= 2  # also, 2 >= 1

In [None]:
1 < 2 and 2 < 3

In [None]:
1 < 2 < 3

#### Numeric

In [None]:
x = 2  # int
y = 5.0  # float
z = 2 + 1j  # complex
z_alt = complex(2, 1)  # complex

In [None]:
type(x)

In [None]:
type(y)

In [None]:
type(z)

In [None]:
x + y

In [None]:
x * y

#### Strings

In [None]:
language1 = 'Python'
language2 = "Python"
language3 = '''Python'''  # multi-line
language4 = """Python"""  # multi-line

In [None]:
language1 == language2 == language3 == language4

In [None]:
py_version = "3.6"
language = "Python {}".format(py_version)
language

In [None]:
type(language)

In [None]:
language[0]  # Python is zero-indexed

In [None]:
language.lower()

In [None]:
language.upper()

In [None]:
language.replace('3.6', 'X.Y')  # (old, new)

In [None]:
language

#### Sequences: lists and tuples

In [None]:
my_list = ['red', 'green', 'blue']
my_tuple = ('Marco', 'UK', True)

In [None]:
'red' in my_list

In [None]:
'John' in my_tuple

In [None]:
for item in my_list:
    print(item)

#### Mappings: dictionaries

In [None]:
item = {
    'id': 123,
    'price': 19.90,
    'description': 'Book',
    'in_stock': True
}

In [None]:
item

In [None]:
type(item)

In [None]:
item['price']

In [None]:
item['price'] = 24.9
item['price']

In [None]:
del item['in_stock']
item

In [None]:
item['available'] = True
item

In [None]:
# item['does_not_exist'] # uncomment and run - this will throw a 'KeyError'

#### Custom Functions

In [None]:
def add_them(a, b):
    return a + b

In [None]:
add_them(10, 5)

In [None]:
add_them('foo', 'bar')

In [None]:
# add_them('hello', 3) # this will throw a 'TypeError' error

#### Control Flow

Let's have a look at some control flow statements

#### If/else statements

The `if` statement is used for conditional execution

In [None]:
a = 10
b = 10
if a > 0 and b > 0:
    print("Add them! {}".format(add_them(a, b)))
elif a == 0:
    print("A is zero")
else:
    print("Nothing to see")

#### While statements
You can use `while` to loop while a condition is through

In [None]:
x = 5
while x > 0:
    print(x)
    x = x - 1

#### For
You can use `for` to loop through an iterable

In [None]:
for x in [1, 2, 3]:
    print(x)


for x in (1, 2, 3):
    print(x)


my_string = "HELLO"
for letter in my_string:
    print(letter)

#### Exercise
Given a list of products, print out the name of all the products with a price higher than 10

In [None]:
products = [
    {'name': 'orange', 'price': 20},
    {'name': 'apple', 'price': 8},
    {'name': 'banana', 'price': 10}
]

#### Solution:

In [None]:
for item in products:
    if item["price"] > 10:
        print(item["name"])

### Understanding Iterables
Iterables: anything you can iterate on
For example:
- sequences (tuples, lists)
- mappings (dictionaries)
- generators

#### Sequences
Sequences are iterables with random access

In [None]:
record = ('Marco', 'UK', True, 123)  # tuple
colours = ['blue', 'red', 'green']  # list

In [None]:
for item in record:
    print(item)

In [None]:
for item in colours:
    print(item)

In [None]:
['blue', 'blue', 'red'] == ['red', 'blue', 'blue']  # order matters!

In [None]:
record[0]  # Python is zero-indexed

In [None]:
colours[2]  # get the Nth item

In [None]:
colours[-1]  # get the last item

##### Mutable vs Immutable data
+ Tuples are immutable: once created they cannot be modified
+ Lists are mutable

In [None]:
# record[0] = 'Jane'  # tuples are immutable!

In [None]:
colours[0] = 'yellow'  # lists are mutable
colours

In [None]:
colours.append('orange')
colours

In [None]:
colours.extend(['black', 'white'])
colours

In [None]:
colours.extend('purple')
colours

#### Slicing Lists

In [None]:
colours = ['blue', 'red', 'green', 'black', 'white']

In [None]:
colours[0:2]

In [None]:
colours[3:4]

In [None]:
colours[:3]

In [None]:
colours[:-1]

#### Generators
Generators are a convenient way to built iterators.

Important to remember:

- Lazyness: values are generated on-demand
- Lazyness: values are not in memory
- You can iterate only once
- You cannot access randomly

In [None]:
# This is equivalent of the built-in range() in Python 3
def my_range(n):
    num = 0
    while num < n:
        yield num
        num += 1

In [None]:
x = my_range(5)
x

In [None]:
for item in x:
    print(item)

In [None]:
for item in x:
    print(item)

In [None]:
import sys

In [None]:
sys.getsizeof(my_range(5))

In [None]:
sys.getsizeof(list(my_range(5)))

In [None]:
sys.getsizeof(my_range(1000))

In [None]:
sys.getsizeof(list(my_range(1000)))

### List Comprehensions
- List comprehension
- More on generators
- `map()` / `reduce()` / `filter()`

#### List comprehensions
A comprehension is a construct that allows sequences to be built from other sequences

In [None]:
squares = []
for x in range(10):
    squares.append(x*x)
squares

In [None]:
squares = [x*x for x in range(10)]
squares

In [None]:
combos = []
for x in [1, 2, 3]:
    for y in [1, 2, 3]:
        if x != y:
            combos.append( (x, y) )
            
combos

In [None]:
combos = [(x, y) for x in [1, 2, 3] for y in [1, 2, 3] if x != y]
combos

In [None]:
combos = [(x, y) for x in [1, 2, 3]
                 for y in [1, 2, 3]
                 if x != y]
combos

In [None]:
combos = [(x, y) for x in [1, 2, 3]
                 for y in [1, 2, 3]
                 if x == y]
combos

#### Dictionary comprehension

In [None]:
words = "The quick brown fox jumped over the lazy dog".split()
words

In [None]:
word_len = {}
for w in words:
    word_len[w] = len(w)
    
word_len

In [None]:
word_len = {w: len(w) for w in words}
word_len

#### Back to generators
We can define a generator using the comprehension syntax

In [None]:
squares = (x*x for x in range(10))
squares

In [None]:
for item in squares:
    print(item)

#### map, reduce and filter

In [None]:
# map() example
def f(x):
    return x*x
numbers = range(10)
# squares = (f(x) for x in numbers)
squares = map(f, numbers)
squares

In [None]:
list(squares)

In [None]:
# map() over multiple sequences
def add_them(a, b):
    return a + b
seq_a = [2, 4, 6]
seq_b = [1, 2, 3]
results = map(add_them, seq_a, seq_b)
list(results)

In [None]:
# reduce() example
from functools import reduce
seq = [1, 2, 3, 4]
results = reduce(add_them, seq)
results

In [None]:
# filter() example
def is_even(x):
    return x % 2 == 0
seq = range(10)
even_numbers = filter(is_even, seq)
list(even_numbers)

In [None]:
seq = range(10)
even_numbers = (x for x in seq if is_even(x))
list(even_numbers)

### Dates and Times
Dates and times can be tricky because there are many ways to represent them.

#### datetime objects

In [None]:
"2017-02-01" == "01/02/2017"

In [None]:
from datetime import datetime
some_day = datetime(2017, 2, 1, 16, 30, 0)
some_day

In [None]:
some_day.strftime("%Y-%m-%d %H:%M:%S")

In [None]:
d1 = datetime.strptime("2017-02-01", "%Y-%m-%d")
d2 = datetime.strptime("01/02/2017", "%d/%m/%Y")
d1 == d2

Recap:
    
- `strftime()` for object-to-string
- `strptime()` for string-to-object

In [None]:
# Timezone name
some_day.strftime("%Z")

In [None]:
# UTC offset
some_day.strftime("%z")

In [None]:
some_day.year, some_day.month, some_day.day

In [None]:
some_day.hour, some_day.minute, some_day.second, some_day.microsecond

In [None]:
some_day.tzinfo

#### More on time zones
- Naive datetime objects don't consider time zones
- Aware datetime objects do consider time zones

In [None]:
from datetime import tzinfo, timedelta
class UTC0(tzinfo):
    def utcoffset(self, dt):
        return timedelta(hours=0)
    def dst(self, dt):
        return timedelta(0)
    def tzname(self,dt):
        return "Europe/London"
    
class GMT1(tzinfo):
    def utcoffset(self, dt):
        return timedelta(hours=1)
    def dst(self, dt):
        return timedelta(0)
    def tzname(self,dt):
        return "Europe/Amsterdam"

In [None]:
some_day = datetime(2017, 2, 1, 16, 30, 0, tzinfo=GMT1())
another_day = datetime(2017, 2, 1, 15, 30, 0, tzinfo=UTC0())
some_day == another_day

In [None]:
some_day.isoformat()

#### Operations on Dates
- `timedelta` objects represents durations of time.
- This allows to perform arithmetic on dates.

In [None]:
from datetime import timedelta
new_day = some_day + timedelta(days=1)

In [None]:
some_day.strftime("%Y-%m-%d")

In [None]:
new_day.strftime("%Y-%m-%d")

In [None]:
some_day > new_day

#### Simple time objects
`time.time()` objects represent a time of the day, independent from any particular day

In [None]:
from time import time
t0 = time()
for x in range(1000000):
    a = x
t1 = time()
t1 - t0

### Accessing Raw Data

#### Opening and reading files

In [None]:
fname = './data/some_file.txt'
f = open(fname, 'r')
content = f.read()
f.close()
print(content)

In [None]:
fname = './data/some_file.txt'
with open(fname, 'r') as f:
    content = f.read()
print(content)

In [None]:
fname = './data/some_file.txt'
with open(fname, 'r') as f:
    content = f.readlines()
print(content)

In [None]:
fname = './data/some_file.txt'
with open(fname, 'r') as f:
    for line in f:
        print(line)

In [None]:
fname = './data/some_file.txt'
with open(fname, 'r') as f:
    for i, line in enumerate(f):
        print("Line {}: {}".format(i, line.strip()))

#### CSV files (Comma Separated Values)
This format is very common for import/export for spreadsheet and databases

In [None]:
import csv
fname = './data/data.csv'
with open(fname, 'r') as f:
    data_reader = csv.reader(f, delimiter=',')
    headers = next(data_reader)
    print("Headers = {}".format(headers))
    for line in data_reader:
        print(line)

In [None]:
fname = './data/data_no_header.csv'
with open(fname, 'r') as f:
    data_reader = csv.reader(f, delimiter=',')
    for line in data_reader:
        print(line)

In [None]:
fname = './data/data.csv'
with open(fname, 'r') as f:
    data_reader = csv.reader(f, delimiter=',')
    headers = next(data_reader)
    data = []
    for line in data_reader:
        item = {headers[i]: value for i, value in enumerate(line)}
        data.append(item)
data

#### JSON (JavaScript Object Notation)
Good for data serialization and communication between services

In [None]:
import json
fname = './data/movie.json'
with open(fname, 'r') as f:
    content = f.read()
    movie = json.loads(content)
movie

In [None]:
import json
fname = './data/movie.json'
with open(fname, 'r') as f:
    movie_alt = json.load(f)

In [None]:
movie == movie_alt

In [None]:
print(json.dumps(movie, indent=4))

In [None]:
import json
fname = './data/movies-90s.jsonl'
with open(fname, 'r') as f:
    for line in f:
        try:
            movie = json.loads(line)
            print(movie['title'])
        except: 
            ...

In [None]:
#### Pickles: Python object serialization

with open('./data/movie.json', 'r') as f:
    content = f.read()
    data = json.loads(content)
data

In [None]:
type(data)

In [None]:
import _pickle as cPickle
with open('./data/data.pickle', 'wb') as f:
    cPickle.dump(data, f)
    print("pickle file created")

In [None]:
with open('./data/data.pickle', 'rb') as f:
    data = cPickle.load(f)
data

In [None]:
type(data)

### Creating NumPy Arrays
NumPy is the core library for scientific computing in Python

In [None]:
import numpy as np

In [None]:
my_list = [1, 2, 3, 4]
x = np.array(my_list)
x

In [None]:
type(x)

In [None]:
x = np.array([1, 2, 3, 4])
x

#### Multidimensional arrays

In [None]:
x = np.array([[1, 2, 3], [4, 5, 6]])
x

In [None]:
x.size

In [None]:
x.shape

In [None]:
x.ndim

In [None]:
x.dtype

In [None]:
x = np.array([[1, 2, 3], [4, 5, 6]], dtype="float")
x

In [None]:
x.dtype

In [None]:
y = x.astype('int')
y

#### Built-in functions for creating arrays

In [None]:
x = np.zeros((3, 2))
x

In [None]:
x = np.ones((2, 3))
x

In [None]:
x = np.eye(4)
x

In [None]:
x = np.diag([1, 2, 3])
x

In [None]:
x = np.arange(0, 10)
x

In [None]:
x = np.arange(0, 20, 2)
x

In [None]:
x = np.linspace(0, 3, 7)
x

In [None]:
len(x)

In [None]:
[1, 2] * 3

In [None]:
x = np.array([1, 2] * 3)
x

In [None]:
x = np.repeat([1, 2], 3)
x

#### Stacking

In [None]:
ones = np.ones((2, 3))
ones

In [None]:
twos = ones * 2
twos

In [None]:
x = np.vstack([ones, twos])
x

In [None]:
x = np.hstack([ones, twos])
x

### Basic Stats and Linear Algebra

In [None]:
import numpy as np
x = np.array([1, 2, 3])
y = np.array([6, 4, 2])

#### Basic arithmetic operations

In [None]:
x + y

In [None]:
x - y

In [None]:
x * y

In [None]:
x / y

In [None]:
x * 2

In [None]:
y + 1

In [None]:
x / 0

In [None]:
x.sum()

In [None]:
w = np.array([[1, 2], [3, 4]])
z = np.array([[5, 6], [7, 8]])
z

In [None]:
z.sum()

In [None]:
w + z

In [None]:
w * 2

#### Basic stats

In [None]:
x

In [None]:
x.min()

In [None]:
x.max()

In [None]:
x.mean()

##### Variance

+ $\mu$ is the mean
+ $Var(X) = \frac{1}{n}\sum_{i=1}^{n} (x_{i} - \mu)^2$

In [None]:
x.var()

##### Standard deviation
 
+ $\sigma = \sqrt{Var(X)}$

In [None]:
x.std()

In [None]:
x

In [None]:
x.argmin()

In [None]:
x[x.argmin()] == x.min()

In [None]:
x.argmax()

#### Matrices and linear algebra

In [None]:
A = np.array([[1, 2, 3], [4, 5, 6]])
A

In [None]:
A.shape

In [None]:
A.mean()

In [None]:
A.max()

In [None]:
A.argmax()

##### Transpose
+ Switch rows and columns

In [None]:
A

In [None]:
A.shape

In [None]:
B = A.T
B

In [None]:
B.shape

In [None]:
C = np.random.randint(0, 10, (2, 3))
C

In [None]:
C.shape

In [None]:
# A + B #there will be a value error as the shapes are different - (2,3) (3,2) respectively.

In [None]:
A + C

##### Dot-product
+ The dot-product requires the matrices to be aligned. E.g., A.shape = 2x3, B.shape = 3x2, output 2x2

In [None]:
np.dot(A, B)

In [None]:
# np.dot(A, C) # 'ValueError' as dimesions are not same.

### Reshaping, Indexing and Slicing

#### Advanced NumPy operations on arrays

##### Reshaping

In [None]:
import numpy as np

x = np.array([1, 2, 3, 4, 5, 6])
x.resize(2, 3)
x

In [None]:
x = np.array([1, 2, 3, 4, 5, 6])
y = x.reshape(2, 3)
x

In [None]:
y

In [None]:
from random import randint
some_number = randint(1, 3)
x = np.arange(5 * some_number)
x

In [None]:
y = x.reshape(5, -1)
y

In [None]:
y = x.reshape(-1, 5)
y

In [None]:
#y = x.reshape(4, -1)  # raises ValueError

#### Indexing and Slicing

In [None]:
x = np.arange(6)
x

In [None]:
x[2]

In [None]:
x[2:4]

In [None]:
x[2:4] = 100
x

##### Broadcasting vs copying

In [None]:
x = np.arange(6)
x_slice = x[2:4]
x_slice

In [None]:
x

In [None]:
x_slice[:] = 100
x

In [None]:
x = np.arange(6)
x_slice = x[2:4].copy()
x_slice[:] = 100
x

In [None]:
x_slice

##### Two-dimensional indexing and slicing

In [None]:
x = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
x

In [None]:
x[0]  # first row

In [None]:
x[0][2]  # third element of the first row

In [None]:
x[0, 2]  # same as above

In [None]:
x[:, 0]  # first column

In [None]:
x[0:2, 1:]

##### Boolean indexing

In [None]:
names = np.array(['Bob', 'Alice', 'Charles', 'Bob', 'Billie', 'Bob'])
data = np.random.randint(0, 10, (6, 3))
data

In [None]:
names == 'Bob'

In [None]:
data[names == 'Bob']

### Getting Started with Pandas

#### Series
+ A `pandas.Series` is a one-dimensional array-like object

In [None]:
import pandas as pd
from pandas import Series, DataFrame

data = Series([4, -1, 3, 2])
data

In [None]:
data[0]

In [None]:
data.values

In [None]:
data.index

In [None]:
data = Series([4, -1, 3, 2], index=['a', 'b', 'c', 'd'])
data

In [None]:
data[0]

In [None]:
data['a']

In [None]:
data[['a', 'b']]

In [None]:
data[data > 0]

In [None]:
data * 2

In [None]:
import numpy as np
np.exp(data)

In [None]:
city_data = {'London': 8.6, 'Paris': 2.2, 'Berlin': 3.6}
data = Series(city_data, index=['Berlin', 'London', 'Madrid', 'Paris', 'Rome'])
data

#### Data Frames
+ A `pandas.DataFrame` is a table-like structure

In [None]:
purchases = [{'Customer': 'Bob', 'Item': 'Oranges', 'Quantity': 2, 'Unit price': 2},
             {'Customer': 'Bob', 'Item': 'Apples', 'Quantity': 3, 'Unit price': 1},
             {'Customer': 'Bob', 'Item': 'Milk', 'Quantity': 1, 'Unit price': 4},
             {'Customer': 'Alice', 'Item': 'Oranges', 'Quantity': 2, 'Unit price': 2},
             {'Customer': 'Alice', 'Quantity': 2, 'Unit price': 3}]
df = DataFrame(purchases)
df

##### Accessing rows and columns

In [None]:
df.loc[0]

In [None]:
df['Item']

In [None]:
df.loc[0, 'Item']

In [None]:
df.loc[0:2, ['Item', 'Quantity']]

##### Boolean indexing

In [None]:
is_alice = df['Customer'] == 'Alice'
is_alice

In [None]:
df[is_alice]

##### Modifying the data frame

In [None]:
df['Total cost'] = df['Unit price'] * df['Quantity']
df

In [None]:
del df['Total cost']
df

In [None]:
df.drop(4)

In [None]:
df

In [None]:
new_df = df.drop(3)
new_df

In [None]:
df.drop(3, inplace=True)
df

In [None]:
df.loc[4, 'Item'] = 'Bananas'
df

### Essential Operations with Data Frames
#### Loading data from files

In [None]:
import pandas as pd

data = pd.read_csv('./data/data.csv')
data

In [None]:
data = pd.read_json('./data/movie.json')
data

In [None]:
data = pd.read_json('./data/movies-90s.jsonl', lines=True)
data

#### Reindexing
+ Reindexing is the process of creating a new object with the data conformed to a new index

In [None]:
data = pd.Series([3, 1, 2], index=['b', 'a', 'd'])
data

In [None]:
new_data = data.reindex(['a', 'b', 'c', 'd'])
new_data

#### Applying a function

In [None]:
data = pd.DataFrame([[4, 36, 1], [9, 25, 16]],
                    columns=['A', 'B', 'C'],
                    index=['Red', 'Blue'])
data

In [None]:
import numpy as np
np.sqrt(data)

In [None]:
def double_up(x):
    return x * 2
data.applymap(double_up)

In [None]:
data

In [None]:
def difference(x):
    return x.max() - x.min()
data.apply(difference)

In [None]:
data.apply(difference, axis=1)

#### Sorting

In [None]:
data

In [None]:
data.sort_index()  # sort by row labels, ascending

In [None]:
data.sort_index(axis=1,           # sort by column labels
                ascending=False)  # descending


data

In [None]:
data.sort_values(by='B')

In [None]:
data.sort_values(by='Blue', axis=1)

#### Handling missing data

In [None]:
data = pd.Series([1, 2, np.nan, 3, np.nan])
data

In [None]:
data == None

In [None]:
data.isnull()

In [None]:
data.notnull()

##### Filtering out missing data

In [None]:
data.dropna()

In [None]:
data[data.notnull()]

##### Filling in missing data

In [None]:
data.fillna(0)

In [None]:
data.fillna(data.mean())

In [None]:
data.fillna({2: 100, 4: 500})

### Summary Stats from a Data Frame

##### Notice
 
+ All the functions discussed here return a *new* pandas object
+ If we need to change the object in place, we need `inplace=True`

In [None]:
import pandas as pd


data = pd.read_csv('./data/store_data.csv')
data

In [None]:
data.head()

In [None]:
data['TOTAL'] = data['QUANTITY'] * data['UNIT PRICE']
data.tail()

In [None]:
data.sum()

In [None]:
data[['QUANTITY', 'TOTAL']].sum()

In [None]:
data['UNIT PRICE'].mean()

In [None]:
data['UNIT PRICE'].fillna(0).mean()

+ `count()`: number of values (exc NaN)
+ `min(), max()`: compute minimum and maximum value
+ `sum()`: sum of values
+ `mean()`: mean of values
+ `median()`: arithmetic median of values

In [None]:
data['UNIT PRICE'].median()

In [None]:
data['UNIT PRICE'].fillna(0).sort_values()

In [None]:
data['TOTAL'].max()

In [None]:
data['TOTAL'].argmax()  # index location (int)

In [None]:
data['TOTAL'].idxmax()  # index label

In [None]:
data['ITEM'][8]

In [None]:
data.describe()

In [None]:
data['ITEM'].unique()

In [None]:
data['ITEM'].value_counts()

In [None]:
%matplotlib inline

data['QUANTITY'].hist()

### Data Aggregation over a Data Frame

In [None]:
import pandas as pd


data = pd.read_csv('./data/store_data.csv')
data['TOTAL'] = data['QUANTITY'] * data['UNIT PRICE']
data.head()

##### Group By (Aggregation)

In [None]:
data.groupby('CUSTOMER')

In [None]:
data.groupby('CUSTOMER').sum()

In [None]:
data['TOTAL'].groupby(data['CUSTOMER']).sum()

In [None]:
data.groupby('CUSTOMER')['TOTAL'].sum()

##### Example
+ Best-selling items (highest quantity and total revenue)

In [None]:
data.groupby(data['ITEM'])[['QUANTITY', 'TOTAL']].sum().sort_values(by='QUANTITY', ascending=False)

##### Example
 + Returning customers

In [None]:
data['DATE'].groupby(data['CUSTOMER']).count()  # wrong count

In [None]:
data.head()

In [None]:
data['DATE'].groupby([data['CUSTOMER'], data['DATE']]).count()

In [None]:
data['DATE'].groupby(data['CUSTOMER']).unique()

In [None]:
data['DATE'].groupby(data['CUSTOMER']).unique().apply(len)

### Exploratory Analysis of the Titanic Data Set

In [None]:
import pandas as pd
fname = './data/train.csv'
data = pd.read_csv(fname)

In [None]:
len(data)

In [None]:
data.head()

+ `PassengerId`: serial ID
+ `Survived`: 1=survived, 0=didn't survive
+ `Pclass`: passenger class (1, 2, or 3)
+ `Name`: full name of the passenger
+ `Sex`: male or female
+ `Age`: age in years
+ `SibSp`: # of siblings or spouses aboard the Titanic
+ `Parch`: # of parents of children aboard the Titanic
+ `Ticket`: ticket number
+ `Fare`: passenger fare
+ `Cabin`: cabin number
+ `Embarked`: port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

In [None]:
data.count()

In [None]:
data['Age'].min(), data['Age'].max()

In [None]:
data['Survived'].value_counts()

In [None]:
data['Survived'].value_counts() * 100 / len(data)

In [None]:
data['Sex'].value_counts()

In [None]:
data['Pclass'].value_counts()

In [None]:
alpha_color = 0.5
data['Survived'].value_counts().plot(kind='bar')

In [None]:
data['Sex'].value_counts().plot(kind='bar',
                                color=['b', 'r'],
                                alpha=alpha_color)

In [None]:
data['Pclass'].value_counts().sort_index().plot(kind='bar',
                                                alpha=alpha_color)

In [None]:
data.plot(kind='scatter', x='Survived', y='Age')

In [None]:
data[data['Survived'] == 1]['Age'].value_counts().sort_index().plot(kind='bar')

In [None]:
bins = [0, 10, 20, 30, 40, 50, 60, 70, 80]
data['AgeBin'] = pd.cut(data['Age'], bins)

In [None]:
data[data['Survived'] == 1]['AgeBin'].value_counts().sort_index().plot(kind='bar')

In [None]:
data[data['Survived'] == 0]['AgeBin'].value_counts().sort_index().plot(kind='bar')

In [None]:
data['AgeBin'].value_counts().sort_index().plot(kind='bar')

In [None]:
data[data['Pclass'] == 1]['Survived'].value_counts().plot(kind='bar')

In [None]:
data[data['Pclass'] == 3]['Survived'].value_counts().plot(kind='bar')

In [None]:
data[data['Sex'] == 'male']['Survived'].value_counts().plot(kind='bar')

In [None]:
data[data['Sex'] == 'female']['Survived'].value_counts().plot(kind='bar')

In [None]:
data[(data['Sex'] == 'male') & (data['Pclass'] == 1)]['Survived'].value_counts().plot(kind='bar')

In [None]:
data[(data['Sex'] == 'male') & (data['Pclass'] == 3)]['Survived'].value_counts().plot(kind='bar')

In [None]:
data[(data['Sex'] == 'female') & (data['Pclass'] == 1)]['Survived'].value_counts().plot(kind='bar')

In [None]:
data[(data['Sex'] == 'female') & (data['Pclass'] == 3)]['Survived'].value_counts().plot(kind='bar')

### Supervised Learning with scikit-learn

In [None]:
import pandas as pd
fname = './data/train.csv'
data = pd.read_csv(fname)

In [None]:
data.head()

##### Using just one feature

In [None]:
data['IsFemale'] = (data['Sex'] == 'female')
samples = data[['IsFemale']]  # X
labels = data['Survived']  # y

##### Train/test split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(samples,
                                                    labels, 
                                                    train_size=0.7, 
                                                    random_state=0)
print("Samples: train={}, test={}".format(len(X_train), len(X_test)))

In [None]:
X_train['IsFemale'].value_counts()

##### Dummy Classifier (most frequent class)

In [None]:
from sklearn.dummy import DummyClassifier
clf_dummy = DummyClassifier(strategy="most_frequent")
clf_dummy.fit(X_train, y_train)
y_predicted = clf_dummy.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
print("Accuracy={}".format(accuracy_score(y_test, y_predicted)))

##### Random forest classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train, y_train)
y_predicted = clf.predict(X_test)
print("Accuracy={}".format(accuracy_score(y_test, y_predicted)))

##### Using more features

In [None]:
samples = data[['IsFemale', 'Pclass']]
labels = data['Survived']
X_train, X_test, y_train, y_test = train_test_split(samples,
                                                    labels, 
                                                    train_size=0.7, 
                                                    random_state=0)

In [None]:
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train, y_train)
y_predicted = clf.predict(X_test)
print("Accuracy={}".format(accuracy_score(y_test, y_predicted)))

In [None]:
data['AgeSentinel'] = data['Age'].fillna(-100)

In [None]:
features = ['IsFemale', 'Pclass', 'AgeSentinel']
samples = data[features]
labels = data['Survived']
X_train, X_test, y_train, y_test = train_test_split(samples,
                                                    labels, 
                                                    train_size=0.7, 
                                                    random_state=0)

In [None]:
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train, y_train)
y_predicted = clf.predict(X_test)
print("Accuracy={}".format(accuracy_score(y_test, y_predicted)))

In [None]:
features = ['IsFemale', 'Pclass', 'AgeSentinel', 'Fare']
samples = data[features]
labels = data['Survived']
X_train, X_test, y_train, y_test = train_test_split(samples,
                                                    labels, 
                                                    train_size=0.7, 
                                                    random_state=0)

In [None]:
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train, y_train)
y_predicted = clf.predict(X_test)
print("Accuracy={}".format(accuracy_score(y_test, y_predicted)))

In [None]:
data['FamilySize'] = data['SibSp'] + data['Parch']
features = ['IsFemale', 'Pclass', 'AgeSentinel', 'Fare', 'FamilySize']
samples = data[features]
labels = data['Survived']
X_train, X_test, y_train, y_test = train_test_split(samples,
                                                    labels, 
                                                    train_size=0.7, 
                                                    random_state=0)

In [None]:
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train, y_train)
y_predicted = clf.predict(X_test)
print("Accuracy={}".format(accuracy_score(y_test, y_predicted)))

##### Feature importance

In [None]:
import matplotlib.pyplot as plt
plt.bar(range(len(features)), clf.feature_importances_, tick_label=features)
plt.show()

In [None]:
##### What else? (Exercise)
# 
# - Different features?
# - Different classifiers?