# Session 1 : Discovering Data Analysis libraries

## Jupyter notebook

Jupyter notebooks allow you to create/edit/view documents that contain code snippets and other types of data (like images, tables or even formula) with a web interface. Code snippets can be executed and the results will be displayed in your document.

In our case, every time you run a Python code snippet, the browser sends a request to the server, the server interprets the code, and returns the result to the browser. Select the block below this paragraph and click the "Run" icon in the toolbar.

In [None]:
print("13 * 14 equals", 13 * 14)

Go to the "Help" section in the menu bar, and select "User Interface Tour" to know more about how to interact with a Jupyter notebook. Once this is done, select again the previous code cell, and add a line of code similar to the one already existing to print the result of 14 * 14. 

In [None]:
# Cells in Jupyter notebook can either be a code cell or a markdown cell. You can modify the type of a cell with the appropriate item in the toolbar. Markdown is more suited when you need to explain something with text. Modify this cell so it becomes a markdown cell. You might need to do something more so it appears like a regular paragraph.

### IPython

Jupyter notebooks rely on IPython, an enhanced shell compared to the basic Python shell. You can launch IPython on a command line with the command ```ipython```, but you can also use the code cells of a Jupyter notebook.

By default, IPython prints the string representation of variables if they are the only expression of a line (even if no explicit print was written).

In [None]:
number = 10 # declaration of a variable named `number`

In [None]:
number      # show the string representation of `number`

There is a difference between using the print() function, and using the default string representation of IPython (which is "prettier" than basic print() function).

In [None]:
code = {'A': 1, 'B': 2, 'C': 4, 'D': 8, 'E': 16,
        'F': 1, 'G': 2, 'H': 4, 'I': 8, 'J': 16,
        'K': 1, 'L': 2, 'M': 4, 'N': 8, 'O': 16,
        'P': 1, 'Q': 2, 'R': 4, 'S': 8, 'T': 16,
        'U': 1, 'V': 2, 'W': 4, 'X': 8, 'Y': 16,
        'Z': 1
       }
print(code)

In [None]:
code

IPython has built-in TAB completion. Try to type the first letters of the variable number (like `nu`) and then type TAB.

In [None]:
# nu+<TAB>

TAB completion also works for methods of Python object. Type `b.` then type TAB.

In [None]:
b = [1, 2, 3, 4, 5, 6, 7, 8]

In [None]:
# b.+<TAB>

IPython can display information and documentation about objects or functions if you add a `?` before/after the name, like `?b` or `b?`. Print the documentation of the object `code` in the cell below.

If you use a double `??` on a function, it will try to display the source code of the function. Print the source code of the function randrange from the random library in the cell below. 

IPython has some special commands known as "magic command". They are preceeded with a `%`. There are many magic commands, you can list them with `%lsmagic`, and then read the documentation with `%some_command?`. Some of them are :

* `%run path/to/script.py` : run the script.py file, print the result in notebook. Variables/functions declared in script.py can then be used in the notebook.
* `%timeit` : measure the execution time of a function or a block of code.
* `%reset` : delete all declared variables.

In [None]:
def f(a, b, c):
    return (a ** b) % c

In [None]:
%timeit f(7999, 123, 10000007)

IPython can also run UNIX commands by preceeding the name of the command with `!`. Use this syntax to know the current working directory, and print the date and calendar of the current month.

TAB completion also work when you want to use UNIX commands to interact with files or directories in your system. Type `!ls -l /ho` then TAB to get information about files in your HOME directory.

## Numpy

Numpy is a Python library that implements very fast arrays operations as well as a faster data structure for multi-dimensional arrays (ndarrays). Numpy is the fundamental library for scientific computing and data analysis, as many other libraries (pandas, scipy...) are built on top of it.

Numpy has many routines written in C/C++/FORTRAN and provides tools to integrate code written in these languages when high performances are required. This means that a program  doing computations with Numpy is **A LOT faster** than the same program written in pure Python.

The first thing in many Numpy programs is to import the library. The name `np` is a convention (and you might find it very often if you look for code samples on the Web).

In [None]:
import numpy as np

### ndarrays : Numpy arrays

The main data type of Numpy is the ndarray. It is a multi-dimensional array of homogeneous data (all data have the same type). You can create one by passing a list to the `np.array()` function for 1D array, or a nested list for a multi-dimensional array.

In [None]:
simple_arr = np.array([1.3, 7.9, 11.89, -23.9])
simple_arr

In [None]:
arr = np.array([[1, 2, 3], [4, 5, 6]])
arr

Every ndarrays have a `shape` and a data type `dtype`.

In [None]:
print(arr.shape)
print(arr.dtype)

You can also cast a ndarray to change its data type.

In [None]:
arr.astype(np.float32)

Create a 3-dimensional ndarray `forme` of shape (3, 2, 3) where element at coordinates `(i, j, k)` has the value `(i+j)*k`. Change its type to unsigned 16-bit integer.

Using operators `+, -, /, *, %` or `**` on a ndarray will compute the operation for each element individually.

In [None]:
arr * 5

In [None]:
arr - 20.1 # cast the elements of arr to float before operation

In [None]:
arr ** 0.5

In [None]:
1 / arr

### Creating ndarrays

Numpy have some prebuilt functions to create special kind of arrays.

In [None]:
# array of ten '1'
np.ones(10)

In [None]:
# multi-dimensional array of '1' (use a tuple as parameter
# to specify the dimensions)
np.ones((2, 3, 2)) 

In [None]:
# multi-dimensional array of '0'
np.zeros((2, 4)) 

In [None]:
# 4x4 identity matrix
np.eye(4) 

In [None]:
# array of integers from 0 to 15
np.arange(16)

In [None]:
# you can transform a 1D array to a matrix with the 
# reshape() method
np.arange(9).reshape((3, 3))

In [None]:
# reshape() can also be used to change the 
# dimensions of an exsiting matrix
x = np.ones((2, 6))
print(x.shape)
y = x.reshape((3, 4))  # create new matrix, x remains a (2,6) matrix
print(y.shape)
print(x.shape) # same as before

Use the prebuilt functions as well as the operations between matrices and scalar to build a (6, 6) matrix that looks like this:
```
 5.2   11.2   12.2   ...  15.2
16.2   12.2   18.2   ...  21.2
  ...
40.2   41.2               40.2
```

### Operations on matrices

You can access the elements of a 1D ndarray in the same way as for a Python list.

In [None]:
a = np.arange(10)
print("a =", a)
print("a[6] =", a[6])
print("a[2:5] =", a[2:5])

When you slice a ndarray and assign it to another variable, you actually create **a view**: the elements are not copied, they are still the elements of the original array. This means that any modifications you do on a view will modify the original array as well.

Most of the operations you can do on a ndarray will return a view, so be careful when you use them.

In [None]:
x = np.arange(10)
print(x)
slice_x = x[3:7]
slice_x[1] = 999
print(x) # x now contains the value 999, even if we modified slice_x

For multi-dimensional arrays, you can use the Python syntax ``ar[i][j]`` or the simpler version ``ar[i, j]``. In this case, ``i`` represents the row, ``j`` represents the column. If you use ``:`` as a row (resp. column), it will select all rows (resp. columns) according to the column (resp. row).

In [None]:
m = np.arange(20).reshape(4,5)
print(m)
print(m[1][2]) # element in row 2, col 3
print(m[1, 2]) # same
print(m[:,1])  # all rows in col 2 = entire col 2

In [None]:
# use the correct indexes to select the square    2  3
#                                                 5  6
# from the matrix U
U = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(U)

Doing operations between two arrays with same dimensions will compute the operation elementwise.

In [None]:
z = np.array([[1, 2], [3, 4]])
e = np.array([[7, -3], [19, 22]])
print(z)
print(e)
print(z * e)

You can also select rows/columns/elements given a condition.

In [None]:
d = np.random.randn(4,3)
print(d)        # entire matrix

In [None]:
print(d[d > 0]) # only the positive values in d

In [None]:
d[d < 0] = 0    # modify only negative elements in d
print(d)

### Common operations on ndarray

In [None]:
# transpose
arr = np.random.randn(2, 5) # 2 rows, 5 cols
arr.T # 5 rows, 2 cols

In [None]:
# dot product
print(np.dot(z, e))
print(z.dot(e))

In [None]:
# elementwise functions. Take only 1 ndarray in argument
print(np.log(z))
print(np.tanh(e))

In [None]:
# functions between two ndarrays
print(np.add(z, e))     # same as z + e
print(np.minimum(z, e)) # elementwise minimum
print(z > e)            # elementwise comparison

In [None]:
# Let's f(x) = exp(sin(x)) and g(x) = sqrt(cosh(x)).
# Create a variable L that contains the maximum value
# between f(x) and g(x) for x in [-5, 5] and a step
# of 0.001. Measure the time needed to create L.

Most of the functions that compute statistics on ndarrays can be global (like computing the mean of every elements of a matrix) or along one axis (like computing the mean for every row). By default, `axis=0` operates on columns while `axis=1` operates on rows.

In [None]:
m = np.arange(20).reshape(4,5)
print(m)
print(m.mean())        # global mean
print(m.mean(axis=0))  # mean of every columns
print(m.mean(axis=1))  # mean of every rows

In [None]:
# the same thing happens for other methods like sort()
x = np.ceil(np.random.randn(4, 4) * 7 + 1)
print("-- x\n", x)
print("-- sorted cols\n", np.sort(x, axis=0))  # sort columns
print("-- sorted rows\n", np.sort(x, axis=1))  # sort rows

## Matplotlib

Making plots and static/interactive visualizations is **one of the most important tasks in data analysis**. Matplotlib is a Python library to create plots to visualize data. Plots can be displayed directly in a Jupyter notebook or can be saved as JPG, PNG, SVG... for future use. The first thing to do before using it is to import it. The `plt` name is also a convention that you can find on many web resources dealing with Matplotlib.

In [None]:
import matplotlib.pyplot as plt

### First plots

The main function to create a plot is simply `plot()`. You can  pass a list of values `ar` and the library will draw a line joining all points `(i, ar[i])`. You can also pass two lists `x` and `y`: in this case, the line will join the points `(x[i], y[i])`.

In [None]:
plt.plot([3, 7, -2, 2.2, 3])

In [None]:
plt.plot([5, 10, 20, 21], [3.6, 4.2, 3.88, 6.3])

As you can see, the plots are displayed inside the Jupyter notebook, but they are not really interactive. Try to add the magic command `%matplotlib` right after we loaded the library, and run again the examples. What are the differences ?

In [None]:
# type you answer here

### Creating plots

Plots reside in a Figure object. It is like a blank canvas where the library will draw the plots. A figure is automatically created if you call the `plot()` method.

In [None]:
fig = plt.figure() # manually create a figure

You cannot directly draw a plot on an empty figure, you need to create one or more subplots. Then you can draw on the subplots.

In [None]:
ax1 = fig.add_subplot(2, 2, 1) # divide the figure in 2 x 2 grid, add axes in position 1
ax2 = fig.add_subplot(2, 2, 2) # divide the figure in 2 x 2 grid, add axes in position 2

In [None]:
ax1.plot(np.arange(5))  # draw a line on the ax1 subplot

In [None]:
# you can create both a figure and all its subplots at the same time
# here we create 6 subplots, 2 rows of 3, and add a plot in the right
# of the bottom row
fig2, axes = plt.subplots(2,3)
axes[1][2].plot(np.random.randn(5))

### Customizing plots

You can change the color and style of the line with some attributes. Look at the documentation of `plot()` to see all the available options.

In [None]:
fig, ax = plt.subplots(1, 1)
ax.plot(np.random.randn(10), linestyle='--', color='green', marker='d')
ax.plot(3 * np.random.randn(10), linestyle=':', color='r', marker='o')

# try to add a label attribute to both line. Then call
# ax.legend() to print the legend on the plot.

In [None]:
# plot the following function representations on the same plot:
#  - sin(3*x) in magenta dashed line with hexagon marker
#  - cos(2* pi * x) in blue solid line with square marker

Once you have created a plot, you can add a title, set the coordinates of the ticks on X or Y axis, set a title for both axis... Run the code cell after each time you uncomment a line to see the difference.

In [None]:
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.plot(np.random.randn(1000).cumsum())

# set a tick at 0, 250, 500, 750 and 1000 on x-axis
#ax.set_xticks([0, 250, 500, 750, 1000])

# replace numbers on x-axis with a name
#ax.set_xticklabels(['one', 'two', 'three', 'four', 'five'], rotation=30)

# set a name for the x-axis
#ax.set_xlabel('Levels')

# set the title of the plot
#ax.set_title("Evolution of my skill at Tetris")

### Types of plot

The most used type of plot is the line plot, but there are some other types:

* `plt.scatter(x, y)`: draw points `(x[i], y[i])`, but no line between the points. Useful if you have a bunch of data points and you want to visualize them to get the general trend.
* `plt.hist(X, bins=50)`: draw an histogram of X composed of 50 equally spaced buckets.

In [None]:
plt.scatter(np.arange(30), np.random.randn(30) * 3 + np.arange(30))

In [None]:
plt.hist(np.random.randn(3000), bins=60)

### Saving figures

You can save a plot with the `savefig()` method. The type of the image (PNG, SVG, JPEG...) is deduced from the filename. See the full documentation for the complete list of available options.

In [None]:
fig, ax = plt.subplots(1, 1)
ax.hist(np.random.randn(3000), bins=60)
fig.savefig('myplot.png', bbox_inches='tight')

## Pandas

Pandas is the main library to carry out data analysis in Python. Pandas is built on top of Numpy and has two main data structures : Series and DataFrame. Both are used to process, manipulate and analyze data. To use Pandas, we import it as follows (`pd` is also a convention in the Python language) :

In [None]:
import pandas as pd
from pandas import Series, DataFrame

### Series

A Series is a 1D data structure composed of an array of data (the values), and an associated array of labels (the indexes). If no labels are provided, they default to integers (0, 1, 2...) but they can also be date, strings...

A Series can be seen as a **fixed-length sorted dictionary**.

In [None]:
# series with integers as indexes
obj = Series([11, 9, -3, 5]) # indexes are 0, 1, 2...
print(obj)
print(obj.values)
print(obj.index)
print(obj[2]) # get element with index 2

In [None]:
# series with strings as indexes
new_obj = Series([11, 9, -3, 5], index=['a', 'd', 'c', 'r'])
print(new_obj)
print(new_obj.index)
print(new_obj['a'])  # get element with index 'a'
print(new_obj[['c', 'd', 'r']]) # get multiples elements simultaneously

In [None]:
# Series are like dictionaries. You can even create a Series 
# with a dictionary.
d = {"Ohio": 35000, "Texas": 71000, "Utah": 5000}
obj2 = Series(d)
print(obj2)

You can add two Series together. It will sum the values of common indexes, and set a NaN for values that are specific to only one of the Series. This is called **data alignment**.

In [None]:
new_obj2 = Series([2, -6, 8, 1], index=['b', 'r', 'e', 'c'])
new_obj3 = new_obj + new_obj2
print(new_obj3)
# call the method .isnull() on new_obj3. Can you tell what it does ?

### DataFrame

DataFrames (DF) represent tabular data, like a spreadsheet. It contains an ordered collection of columns, as well as rows that can be labeled with indexes. DataFrame are like a dictionary of Series where each column is a Series.

In [None]:
# the most common way to create a DataFrame is with a dictionary.
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year':  [  2000,   2001,   2002,     2001,     2002],
        'pop':   [   1.5,    1.7,    3.6,      2.4,      2.9],
        'area':  [   116,    116,    116,      286,      286],
       }
frame = DataFrame(data)
frame # DataFrame are printed nicely within Jupyter notebooks.

In [None]:
# You can set the name of columns and rows when you create a DF.
# If you set a column name that is not in data, all rows will have
# the value NaN for this column.
frame = DataFrame(data, index=['one', 'two', 'three', 'four', 'five'],
                        columns=['state', 'year', 'pop', 'area', 'debt'])
frame

You can extract one column from a DataFrame with a single `[]` and the name of the column, or multiple columns with the double `[[]]` and all the names of the columns you want.

In [None]:
print(frame['year']) # select the year column (1 column)
print(frame[['state', 'pop', 'year']]) # select 3 columns (in that order)

In [None]:
# every column's name of a DF become an attribute
print(frame.year) # also select the year column

You can select the row of a DF with the `iloc[]` method and its row number, or with the `loc[]` method and the label of the row.

In [None]:
print(frame.iloc[1]) # select second row (because DF are zero-indexed)
print(frame.loc["four"]) # select fourth row, because its label is "four"

In [None]:
# select only the columns pop and year of rows 'one' and 'five'

You can set the values of an entire columns with a single value (all rows will be set to this value), or an array (the array must have the same size as the number of rows in DF). You can also add a column with the same syntax.

In [None]:
print(frame)

# set all rows to -1
frame['debt'] = -1
print('\n', frame)

# assign a different value for each row
frame['debt'] = [10, 11, 12, 8, 9]
print('\n', frame)

# add a new column mayor
frame['mayor'] = ['Thom', 'Jonny', 'Ed', 'Colin', 'Phil']
print('\n', frame)

Add the column `density` that contains the information about the density for each rows.

You can also select rows/columns with boolean condition, like what we did with Numpy.

In [None]:
# select rows according to value of a column
frame[frame.debt < 11]

### Reading, parsing and cleaning data

Pandas has many built-in methods to load data from different types of sources:
* text files
* databases
* web APIs

We will deal with text files because loading data from databases or Web APIs requires specific packages and are out of the scope of this course.
Let's start with a simple example. Use the `cat` UNIX command to look at the content of file example.csv.

In [None]:
# look at example.csv

Now, read the documentation of the `read_csv()` function and use it to load the file in a DataFrame object. Use the column `message` as the index of the DataFrame.

The function `read_table()` is a more general function to read data. You can specify the delimiter (in case it is not a comma) as well as many more options (parse the dates as Date objects, skip some lines...).

In [None]:
# load example.csv

Sometimes, reading and loading data is not as easy as reading a csv file. Most of the data available are not structured and we have to rely on other tools to analyze them. Take a look at the content of the `bitly.txt` file.

In [None]:
# look at bitly.txt with the `head` UNIX command

We can see that this file seems to be encoded in JSON format. Let's try to read it. Complete the following cell block.

In [None]:
import json

path = "bitly.txt"

rec = # load all records of bitly.txt

# print the first record of the file
print(rec[0])

Each record is a dictionary object. This means we can create a Series for each record, hence this will give us a DataFrame for the entire data.

In [None]:
frame = DataFrame(rec)
frame

As you can see, there is a column named `_heartbeat_` that was not present when we took a look at the content of the file. Furthermore, some records are almost empty, except for this field. We need to remove them for our analysis.

In [None]:
# only keep records that have an entry for the field 'a'
# clean_records = ...

print("Total records:", len(rec)) # good records + hearbeats
print("Clean records:", len(clean_records)) # no heartbeats

# make a DataFrame composed of only good records
# bitly = ...

bitly[:5]                    # print the first 5 records

### Operations on DataFrame

Most of the Numpy functions can be used on DataFrame.

In [None]:
cities = DataFrame(np.random.randn(4, 3), columns=list('bde'),
                  index=['Paris', 'Lyon', 'Lille', 'Pau'])
cities

In [None]:
np.exp(cities)

You can use a custom function on each value of the DataFrame with the `.apply()` method.

In [None]:
def f(x):
    return 3*x + 1.7

cities.apply(f)

There are two ways to sort a DataFrame.
* `.sort_index()` : sort the rows according to their respective labels
* `.sort_values(by=...)` : sort the rows according to the value in column specified with argument  `by`. You can pass a list of columns to the by argument if you need to sort with more than one column.

NaN values are placed at the bottom of sorted DataFrame/Series.

In [None]:
d = DataFrame({'b': [4, 7, -3, 2],
               'a': [0, 1, 0, 1],
              }, index=["Venise", "New-York", "Lisbonne", "Tokyo"])
# sort d according to its indexes value

In [None]:
# sort d according to the value of a, then the value of b

### Analyzing DataFrame

The `bitly.txt` file contains information about shortened URL made by the website bit.ly. Every record has the web browser that clicked on the bit.ly link, the location of the person who clicked, the URL that was shortened...

We already loaded this file into a DataFrame. Let's see if we can get some valuable informations about the location of the users.

In [None]:
# the interesting field is named 'tz'. Extract the series
# related to this field into a variable `timezones` and
# print the first 10 rows.

As you can see, some rows have an empty value for this field. Pandas can deal with missing or incorrect data, but we need to tell it explicitly.

In [None]:
# use a boolean condition to select all rows of timezones
# that are '' and set them to 'Unknown'. Print again the 
# first 10 rows to see the difference

In [None]:
# then use the .value_counts() method on timezones. This will
# count the number of occurrence for each different value
# and also sort this list. What are the 5 most frequent
# locations ?

In [None]:
# a chart would be nice to see these results. Read the
# documentation of the method .plot() of DataFrame and print
# the top 10 locations in a horizontal bar graph.

### Apple devices

It would be interesting to know the proportion of Apple devices for each timezones. Combining Numpy and Pandas can provide this information, even if it is not directly present in the data.

The first thing to do is to create a new column `os` (i.e. a Series) in the DataFrame named `bitly`. We will put `Apple` if the entry (i.e. the row) is from an Apple device, and `Other` for all other entries. By looking at the data, I can see that Apple devices have a user agent (the column `a` in bitly) that contains "Mac OS". The function `np.where()` can help us to build such a Series.

In [None]:
# Add a new column named `os` in bitly that contains:
#   - "Apple" if the user agent contains "Mac OS"
#   - "Other" otherwise
#
# Then print the 5 first values of this column.

Then we can regroup the rows that have the same timezone and the same OS with `.groupby()`. This method is similar to the one existing in SQL.

In [None]:
# Look at the documentation of .groupby() method.
# Create a new variable `grouped` that contains the
# information grouped by timezones and operating systems.
# Use the method .size() on `grouped` to get the cardinality
# of each group and assign it to `grouped`. 
# Then print the first 10 items of the variable `grouped`.

We can see that `grouped` has 2 labels for each row (`tz` and `os`). This would be better if we only had 1 label (`tz`) and use the other one (`os`) as a column. We can do this with `.unstack()`.

In [None]:
tz_devices = grouped.unstack()  # now we have 2 more columns : one for Apple, another for Other
tz_devices[:10]

In [None]:
# Use the method .fillna() to replace every NaN values
# in `tz_devices` by 0. Print the 10 first rows to verify
# that they have been updated.

In [None]:
# Using the indexes of the 10 most frequent locations,
# select them in `tz_devices`.

In [None]:
# Now, each row represents a location and have two
# information:
#   - the number of Apple devices
#   - the number of Other devices
# Use the method .sum() to get the total number of devices
# for each location, and then the method .div() to divide
# each row by the appropriate number so that the rows
# contain:
#   - the percentage of Apple devices for each location
#   - the percentage of Other devices for each location

In [None]:
# Print the percentage results in a chart (select the
# appropriate type of graphic).