# Scientific Computing: Numpy & Pandas

# I. Numpy

![image](./images/numpy.png)


Numpy is a Python package dedicated to matrices and vector manipulation, including many useful mathematical functions.

To import it, just do:

```python
import numpy as np
```


In [1]:
import numpy as np

## I.1 Numpy arrays


- Numpy arrays are the backbone structure to work with in Python and data science
- They can be a vector, a matrix or a higher dimensional object
- They can be created using `np.array()`


In [2]:
my_array = np.array([1, 2, 3])
print(my_array)

[1 2 3]


- The shape and type of data can be retrived with the attributes `.shape` and `.dtype`

In [3]:
print('shape:', my_array.shape)
print('data type:', my_array.dtype)

shape: (3,)
data type: int64


### A bit of vocabulary

- Dimensions are called `axis` (`axis=0`: rows, `axis=1`: columns)
- The number of dimensions can be retrieved with `.ndim`
- The size (`.size`) is the total number of elements in an array
- The shape (`.shape`) is a tuple

## I.2 Generating arrays

An array can be easily generated in many ways. Some useful methods already exist:

In [4]:
np.array([ ], dtype=float)

array([], dtype=float64)

In [5]:
np.zeros(5)

array([0., 0., 0., 0., 0.])

In [6]:
np.ones((5,5))

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [7]:
np.full((2,3), 10)

array([[10, 10, 10],
       [10, 10, 10]])

In [8]:
np.eye(5)

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

### More array generators

- `np.arange(5)`: creates an array of 5 values starting at 0, with a step of 1. Arguments `start=`, `stop=` et `step=` can be specified for flexibility.
- `np.linspace`: creates an array with a given number of equidistant values between two numbers

In [9]:
np.arange(0, 10, 1)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [10]:
np.linspace(0, 5, 11)

array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ])

## I.3 Basic mathematical operations

Basic mathematical operations are **element-wise** in numpy:

`*`, `+`, `-`, `/`, `**`, `%`

This works on arrays of same dimensions:

In [11]:
my_array = np.arange(1, 6, 2)

In [12]:
print(my_array+my_array)
print(my_array*my_array)
print(my_array/my_array)

[ 2  6 10]
[ 1  9 25]
[1. 1. 1.]


## I.4 Broadcasting

In some cases, mathematical operations can be execute on arrays of different dimensions.

To do so, there has to be **one dimension in common**:

In [13]:
arr_id=np.ones((3, 2))
arr_id*np.array([1, 2])

array([[1., 2.],
       [1., 2.],
       [1., 2.]])

**Exercise:**

Generate a 5x5 matrix, with values 1, 2, 3, 4, 5 on the diagonal, 0 elsewhere, in **one line of code**.

## I.5 Array manipulation

Array indices is common to other languages:
- Get the i-th element of an array: `arr[i]`
- Get several elements: `arr[i:j]`
- Set values for several elements: `arr[i:j]=100`
- Set all the values of the array: `arr[:]=100`
- To keep the original value use: `.copy()`
- Reverse the indices ordering: `arr[::-1]`
- To swap two values, columns or rows: `array[[0,1]] = array[[1,0]]`


**Exercise :**
    
- Generate an array with values from 0 to 8.
- Display values 1 to 6.
- Set values 6 to 8 to 22 and display the array.

## I.6 2-dimensional arrays

It all remains the same!

the attribute `.shape` is now a two elements tuple. You can use `.reshape` if needed.

To get, set the (i, j)-th element, just do `arr[i, j]`.

In [14]:
arr_2dim=np.arange(10).reshape(5,2)
arr_2dim

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])

In [15]:
arr_2dim[2]=22
arr_2dim[3,1]=33
arr_2dim

array([[ 0,  1],
       [ 2,  3],
       [22, 22],
       [ 6, 33],
       [ 8,  9]])

**Exercise:**
    
Create a 5x5 matrix with values ranging from 0 to 24. Display the 3x3 inside submatrix.

## I.7 Matrix operations on arrays:

- Transpose: `.T`
- Matrix product: `np.dot()`

In [16]:
matrix = np.arange(6).reshape(2, 3)
matrix

array([[0, 1, 2],
       [3, 4, 5]])

In [17]:
matrix.T

array([[0, 3],
       [1, 4],
       [2, 5]])

In [18]:
np.dot(matrix, matrix.T)

array([[ 5, 14],
       [14, 50]])

## 1.8 Mathematical methods on arrays
Numpy has numerous functions on arrays to perform mathematical computations:
- `np.sum()`
- `np.mean()`
- `np.std()`
- `np.median()`
- `np.unique()`

On booleans:
- `np.any()`: `True` if at least one `True`
- `np.all()`: `True` if all `True`

Sort:
- `np.sort()`
- `np.argsort()`

# II. Pandas

![image](./images/pandas.png)

Pandas is a very popular and widely used library for data handling and manipulation: it is probably used by every data scientist!

Pandas is based on two main objects: **Series** and **DataFrame**.

To import pandas, just type:
```python
import pandas as pd
```

In [19]:
import pandas as pd

## II.1 DataFrame

A DataFrame is just a tabular-like structure, with rows and columns. We can create one with the following code:

In [20]:
cities = pd.DataFrame({'France': ['Amiens', 'Belfort', 'Marseille', 'Annecy'],
                       'Germany': ['Munich', 'Berlin', 'Heidelberg', 'Hambourg'],
                       'Spain': ['Madrid', 'Barcelone', 'Seville', 'Valence']})
cities

Unnamed: 0,France,Germany,Spain
0,Amiens,Munich,Madrid
1,Belfort,Berlin,Barcelone
2,Marseille,Heidelberg,Seville
3,Annecy,Hambourg,Valence


One can get or create a column with:
- `cities.France`: does not work when column name has whitespace
- `cities['France']`

A one-column DataFrame is called a **Series**.

One can get a value with:
- `cities.loc[0,'France']`: first the index and then the column name
- `cities.iloc[1,0]`: or just the indices of columns and rows in the DataFrame

One can drop a column or a row with the method `.drop()`

In [21]:
cities['Germany']

0        Munich
1        Berlin
2    Heidelberg
3      Hambourg
Name: Germany, dtype: object

In [22]:
cities.iloc[1, 0]

'Belfort'

## II.2 Data loading with Pandas

One of the most powerful aspect of Pandas is the ability to deal with a lot of databases: CSV, SQL, html, Excel...

There is usually a dedicated function, following the template `pd.read_[datatype]()`.

For example, let's load a CSV file with `pd.read_csv()`:

In [23]:
df = pd.read_csv('imdb_1000.csv')
df.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


One can have a lot of information about a dataset using some pandas methods:
- `.head(n)`: display the `n` first rows
- `.info()`: generic information of the dataset
- `.describe()`: statistical description of the dataset

In [24]:
df.describe()

Unnamed: 0,star_rating,duration
count,979.0,979.0
mean,7.889785,120.979571
std,0.336069,26.21801
min,7.4,64.0
25%,7.6,102.0
50%,7.8,117.0
75%,8.1,134.0
max,9.3,242.0


In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 979 entries, 0 to 978
Data columns (total 6 columns):
star_rating       979 non-null float64
title             979 non-null object
content_rating    976 non-null object
genre             979 non-null object
duration          979 non-null int64
actors_list       979 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 46.0+ KB


## II.3 Missing values with pandas

In pandas, missing values are identified as 'NaN', and can be localized using the function `pd.isna()`:

In [26]:
pd.isna(pd.Series([2,np.nan,4],index=['a','b','c']))

a    False
b     True
c    False
dtype: bool

You can also fill the missing data with a given value if needed, with the method `pd.fillna()`, or just drop it with the method `.dropna()`

In [27]:
pd.Series([2,np.nan,4],index=['a','b','c']).dropna()

a    2.0
c    4.0
dtype: float64

## II.4 Data filtering using pandas

In pandas, data filtering is made easy and intuitive, using boolean conditions.

For example, how could you know if a movie is of genre `'Action'`?

In [28]:
df['genre'] == 'Action'

0      False
1      False
2      False
3       True
4      False
       ...  
974    False
975    False
976     True
977    False
978    False
Name: genre, Length: 979, dtype: bool

Then how could you keep only the `'Action'` movies?

In [29]:
action_movies = df[df['genre'] == 'Action']
action_movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
11,8.8,Inception,PG-13,Action,148,"[u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'..."
12,8.8,Star Wars: Episode V - The Empire Strikes Back,PG,Action,124,"[u'Mark Hamill', u'Harrison Ford', u'Carrie Fi..."
19,8.7,Star Wars,PG,Action,121,"[u'Mark Hamill', u'Harrison Ford', u'Carrie Fi..."
20,8.7,The Matrix,R,Action,136,"[u'Keanu Reeves', u'Laurence Fishburne', u'Car..."


## II.5 Groupby in pandas

Groupby is a powerful function that allows to get information for every classes of a dataset. 
For example, we would have the mean duration as a function of the movie genre: let's use a Groupby:

In [30]:
df.groupby(df['genre'])['duration'].mean()

genre
Action       126.485294
Adventure    134.840000
Animation     96.596774
Biography    131.844156
Comedy       107.602564
Crime        122.298387
Drama        126.539568
Family       107.500000
Fantasy      112.000000
Film-Noir     97.333333
History       66.000000
Horror       102.517241
Mystery      115.625000
Sci-Fi       109.000000
Thriller     114.200000
Western      136.666667
Name: duration, dtype: float64

## II.6 Concatenation in pandas

Finally, one may need to concatenate two (or more) dataframes. This can be done using the function `pd.concat()`:

In [31]:
s1 = pd.DataFrame({'color': ['red', 'orange', 'blue'],
                   'size': ['medium', 'medium', 'small'],
                   'texture': ['hard', 'soft', 'soft']},
                   index=['apple', 'orange', 'blueberry'])


s2 = pd.DataFrame({'color': ['yellow', 'red', 'green'],
                   'size': ['big', 'small', 'small'],
                   'texture': ['hard', 'soft', 'hard']},
                   index=['pineapple', 'raspberry', 'kiwi'])
print(s1)
print(s2)

            color    size texture
apple         red  medium    hard
orange     orange  medium    soft
blueberry    blue   small    soft
            color   size texture
pineapple  yellow    big    hard
raspberry     red  small    soft
kiwi        green  small    hard


In [32]:
pd.concat([s1, s2], axis=0)

Unnamed: 0,color,size,texture
apple,red,medium,hard
orange,orange,medium,soft
blueberry,blue,small,soft
pineapple,yellow,big,hard
raspberry,red,small,soft
kiwi,green,small,hard


**Exercise:**

Filter the movies in the imdb dataset with a star rating above 8. Then display the number of movies for each movie genre.

Conclude what movie genre should you produce to have a good rating.