# Week 4

## Introduction to Numpy, Pandas and Scikit-Learn

Alexander Goncearenco

### What are the keystones of Data Analysis

* Formulating Questions

* Data wrangling: gather, access, clean
`(Week 6)

* Exploratory data analysis
Week 5 (visualization), Week 6, Week 9-10 unsupervised learning, feature selection

* Making conclusions and predictions: modeling, machine learning
(Weeks 7-9, 11)

* Reporting and communication (Week 5, Week 13)

### Working with data in Python

Data types and data structures - 
containers to hold, access and modify data efficiently.

Our options:
- Python built-in data types
- Python built-in data structures and functions
- Python packages extend the built-in capabilities (packages in standard library and pip)
- 3rd party tools. We can always run a specialized software tools from Python

### Built-in data types, structures and functions

* int, float, complex
* dict, list, set and frozenset, tuple, str, bytes
* https://docs.python.org/3/library/stdtypes.html
* https://docs.python.org/3/library/datatypes.html


### List

In [5]:
a = [1,2,3,4,5]
print(a[0])
print(a[-1])

1
5


In [6]:
print(a[:])
print(a[:3], a[3:])
print(a[0:3], a[3:5])

[1, 2, 3, 4, 5]
[1, 2, 3] [4, 5]
[1, 2, 3] [4, 5]


In [7]:
print(a[slice(0, 3)])
print(type(slice(0, 3)))

[1, 2, 3]
<class 'slice'>


In [8]:
print(a[::2])

[1, 3, 5]


In [9]:
del a[::2]
print(a)

[2, 4]


In [10]:
from math import sin
print([sin(x) for x in range(4)])

[0.0, 0.8414709848078965, 0.9092974268256817, 0.1411200080598672]


### Arrays in Python

* list
* array 1-dimensional https://docs.python.org/3/library/array.html
* numpy supports multidimensional arrays

In [11]:
from array import array
from statistics import mean

al = array('l', [1, 2, 3, 4, 5]) # better to use arrays for numbers because space-efficient
print(al)
print(sum(al), mean(al))

array('l', [1, 2, 3, 4, 5])
15 3


In [12]:
ad = array('d', [1.0, 2.0, 3.14])
print(ad)
print(sum(ad), mean(ad))

array('d', [1.0, 2.0, 3.14])
6.140000000000001 2.046666666666667


In [13]:
# However:
print(2 * ad)

print(ad + ad)

array('d', [1.0, 2.0, 3.14, 1.0, 2.0, 3.14])
array('d', [1.0, 2.0, 3.14, 1.0, 2.0, 3.14])


### Other numeric issues

In [14]:
0.1 + 0.1 + 0.1 == 0.3

False

In [15]:
from decimal import Decimal

Decimal('0.1') + Decimal('0.1') + Decimal('0.1') == Decimal('0.3')

True

### Overview of packages

* __numpy__  - N-dimensional arrays and algebra
* scipy - scientific computing
* __pandas__  - data structures & analysis
* __matplotlib__, seaborn - plotting
* __jupyter__ - notebook, integration with pandas and plotting
* __scikit-learn (sklearn)__  - Machine learning
* statistics - standard package - basic descriptive statistics
* statsmodels - statistical modeling, hypothesis testing

### Datasets:

* https://catalog.data.gov/dataset
* http://mlr.cs.umass.edu/ml/datasets.html
* https://www.kaggle.com/datasets
* https://opendata.socrata.com


__Tabular data__: database tables, Excel, CSV


### Accessing data

* Example datasets "red wine quality"
* Download CSV from https://archive.ics.uci.edu/ml/datasets/wine+quality

In [16]:
winequality_file = "week4-files/winequality-red.csv"

In [17]:
from itertools import islice

with open(winequality_file) as f:
    for line in islice(f, 0, 5):
        print(line.split(","))

# exclude header and line endings, convert to float

FileNotFoundError: [Errno 2] No such file or directory: 'week4-files/winequality-red.csv'

In [None]:
# Python CSVReader
import csv

with open(winequality_file) as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in islice(reader, 0, 5):
        print(', '.join(row))

In [None]:
# Python CSVReader
import csv

with open(winequality_file) as csvfile:
    reader = csv.DictReader(csvfile, delimiter=',')
    for row in islice(reader, 0, 3):
        print(', '.join(row.values()))
        
print(row.keys())
print(row['pH'])
# limitations - data structure is not suitable for data analysis

In [None]:
# Numpy: read CSV
from numpy import genfromtxt
wine_np = genfromtxt(winequality_file, delimiter=',', skip_header=1)
print(wine_np)
print(wine_np.shape)

In [None]:
# Pandas: read CSV
from pandas import read_csv
wine_df = read_csv(winequality_file, sep=',',header=0)
print(wine_df.shape)
wine_df.head()

In [None]:
# numpy array operations:
print(wine_np[0,0])  # first element
print(wine_np[0,...]) # row
print(wine_np[...,0]) # column

pH = wine_np[...,8]
print(pH.min(), pH.mean(), pH.max())

In [None]:
# filtering
print(wine_np[pH < 3.2, ...])
print(wine_np[pH < 3.2, ...].shape)

In [None]:
import numpy as np

empty_array = np.zeros((3,4,2))
empty_array

In [None]:
np.random.rand(3,4,2)

In [None]:
# This will ensure the random samples below can be reproduced. 
# This means the random samples will always be identical.

np.random.seed?

# learn how to invoke docstring help

In [None]:
np.random.seed(0)
print(np.random.rand(3))

np.random.seed(0)
print(np.random.rand(3))

np.random.seed(1000)
print(np.random.rand(3))

In [None]:
wine_np[0:3, 8]  # pH values for 3 first wines in the array

In [None]:
# changing values
wine_np[0:3, 8] = [3., 3., 3.]
wine_np[0:3, 8]

In [None]:
wine_np.dtype.name

# change it with wine_np.astype(int)

In [None]:
print(pH)
print(2*pH)
print(pH + pH)
print(np.exp(pH))

In [None]:
M = wine_np[0:2, 0:3]
print(M)
print()
print(M.T)

In [None]:
M.dot(M.T) # matrix multiplication

In [None]:
M.T.dot(M) # matrix multiplication

In [None]:
x = np.array([3,3,3])
M*x  # multiply rows by x. This is a broadcasting operation

### Broadcasting in numpy:
 - The last dimension of each array is compared.
 - If the dimension lengths are equal, or one of the dimensions is of length 1, then we keep going.
 - If the dimension lengths are not equal, and none of the dimensions have length 1, then there's an error.
 - Continue checking dimensions until the shortest array is out of dimensions.

In [None]:
M.dot(x) # matrix multiplication

In [None]:
# but not this:
x.dot(M)  # try to fix it

In [None]:
print(type(wine_np)) # ndarray object
print(wine_np.shape) # note, shape is attribute
wine_np.sum()  # sum() is method

In [None]:
wine_np.sum(axis=0)  # collapsed dimensions

In [None]:
wine_np.sum(axis=1)

In [None]:
# Pandas Dataframe (as in R)
wine_df.info()

In [19]:
wine_df.dtypes

NameError: name 'wine_df' is not defined

In [None]:
wine_df.describe()

In [None]:
wine_df['pH'].head()

In [None]:
wine_df['pH'].head(5)

In [None]:
wine_df['pH'][:5]

In [None]:
wine_df[:5]

In [None]:
wine_df[:5]['pH']

In [None]:
wine_df[:5][['chlorides', 'pH']]

In [None]:
wine_df[['chlorides', 'pH']][:5]

In [None]:
wine_df['quality'].unique()

In [None]:
wine_df['quality'].nunique()

In [None]:
# a histogram by quality
wine_df['quality'].value_counts()

In [None]:
%matplotlib inline
wine_df['quality'].value_counts().plot(kind='bar')

In [None]:
wine_df['quality'] == 3

# This is a big array of Trues and Falses, one for each row in our dataframe. When we index our dataframe with this array, we get just the rows where.

In [None]:
# You can also combine more than one condition with the & operator like this:
bad_wine = wine_df['quality'] == 3
acidic_wine = wine_df['pH'] < 3.3

wine_df[bad_wine & acidic_wine]

In [None]:
# pandas columns are numpy arrays internally
import pandas as pd

pd.Series([1,2,3])

In [None]:
pd.Series([1,2,3]).values

In [None]:
np.mean(wine_df['pH'].values)

In [None]:
# group by 
print(wine_df.groupby('quality'))

In [None]:
wine_df.groupby('quality')['alcohol'].mean()

In [None]:
wine_sorted = wine_df.sort_values(['alcohol'], ascending=False)
wine_sorted.head()

In [None]:
wine_sorted.tail()

In [None]:
# loc gets rows (or columns) with particular labels from the index.
wine_df.loc[544:545]

In [None]:
# iloc gets rows (or columns) at particular positions in the index (so it only takes integers).
wine_df.iloc[544:545]

In [None]:
wine_df.loc[:3, 'pH']

In [None]:
wine_df.iloc[:3, 8]

In [None]:
wine_df.at[0, 'pH']

In [None]:
wine_df.iat[0, 8]

In [None]:
wine_df.at[0, 'pH'] = 3.50
wine_df.at[0, 'pH']

In [None]:
wine_df.at[0, 'body'] = 'full'
wine_df.at[1, 'body'] = 'light'
wine_df.head()

In [20]:
print(wine_df['body'].isna().head())
wine_df.loc[wine_df['body'].isna(), 'quality'] = 0
wine_df.head()

NameError: name 'wine_df' is not defined

In [None]:
# transpose
wine_df.T

In [None]:
wine_df.T.to_csv('transposed.csv', index=False)

In [None]:
pd.merge?

In [None]:
pd.cat?

<img src="http://scikit-learn.org/stable/_static/ml_map.png" width=700>