### Numpy

#### NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

#### Please complete Pandas Tutorial before starting on Numpy.

Datasets Required:
1. exchange-rates-sgd-per-unit-of-usd-daily.csv
2. graduates-from-university-first-degree-courses-by-type-of-course.csv

#### Import pandas and numpy

In [2]:
import pandas as pd
import numpy as np

#### Import CSV file into a dataframe

In [14]:
df = pd.read_csv('exchange-rates-sgd-per-unit-of-usd-daily.csv')
df.describe()

Unnamed: 0,exchange_rate_usd
count,3993.0
mean,1.494808
std,0.197901
min,1.2009
25%,1.3033
50%,1.4622
75%,1.6691
max,2.0503


### Quartiles

#### The first, second and third quartiles of a series, are the 25 percentile, 50 percentile and 75 percentile respectively.

Use np.percentile(SERIES, `x`) to get the `x`th percentile of the distribution.

Find the 25th. 50th and 75th percentile of the exchange_rate_usd.

In [4]:
print(np.percentile(df['exchange_rate_usd'], 25))
print('-------------')
print(np.percentile(df['exchange_rate_usd'], 50))
print('-------------')
print(np.percentile(df['exchange_rate_usd'], 75))


1.3033
-------------
1.4622
-------------
1.6691


### Interquartile Range

Interquartile range is defined as `third quartile - first quartile`.

This can be used to determine outliers, which are values that lie outside `[ (1st_quartile - 1.5 * IQR) , (3rd_quartile + 1.5 * IQR) ]`

Find the interquartile range of exchange_rate_usd.

In [7]:
iqr = np.percentile(df['exchange_rate_usd'], 75) - np.percentile(df['exchange_rate_usd'], 25)
print(iqr)

0.3658000000000001


Find the lower and upper bound of the exchange_rate_usd for determining outliers. [ lower bound, upper bound ]

In [10]:
lower = np.percentile(df['exchange_rate_usd'], 25) - 1.5 * iqr
upper = np.percentile(df['exchange_rate_usd'], 75) + 1.5 * iqr

print (lower, upper)

0.7545999999999997 2.2178000000000004


Find all days which are outliers.

Hint* Set conditions to identify outliers, and apply the conditions to the dataframe.

In [13]:
df2 = df.copy()

below_lower = df2['exchange_rate_usd'] < lower
above_upper = df2['exchange_rate_usd'] > upper

df2 = df2[below_lower | above_upper]
df2

Unnamed: 0,date,exchange_rate_usd


### np.int() np.float() np.sum()

#### These functions can be used to convert datatypes in a dataframe.

First of all, read graduates-from-university-first-degree-courses-by-type-of-course.csv into a dataframe called df3

In [18]:
df3 = pd.read_csv('graduates-from-university-first-degree-courses-by-type-of-course.csv')
df3.sample(5)

Unnamed: 0,year,sex,type_of_course,no_of_graduates
569,2011,Females,Services,50
164,1998,Males,Services,na
367,2005,Males,"Natural, Physical & Mathematical Sciences",321
347,2004,Females,Humanities & Social Sciences,993
399,2006,Males,Dentistry,18


#### Attempt to change the no_of_graduates column to 'float' type. Note that there will be a value error as there are cells with 'na'

Run the code with .apply(np.float)

In [23]:
df3['no_of_graduates'].apply(np.float) 

ValueError: could not convert string to float: 'na'

#### To fix this, use the .replace() function in Pandas to replace 'na' values with '0'. This can also be used with 'Nil Return' (NULL) values.

Steps:
1. Make a copy of the df3 to df_new
2. See that df_new has 'na' as one of the values with .unique()
3. .replace takes in 2 parameters ('source', 'target')
4. Replace 'source' with 'target' from `no_of_graduates`, in our case, 'na' with '0'. 
5. Check that df_new no longer has 'na' with .unique()
6. Change the type to float.

In [44]:
#Uncomment line by line to see the results

df_new = df3.copy()
# df_new['no_of_graduates'].unique()
#df_new['no_of_graduates'] = df_new['no_of_graduates'].replace('na', '0')
# df_new['no_of_graduates'].unique()
#df_new['no_of_graduates'] = df_new['no_of_graduates'].apply(np.float)
#df_new.head()

Unnamed: 0,year,sex,type_of_course,no_of_graduates
0,1993,Males,Education,0.0
1,1993,Males,Applied Arts,0.0
2,1993,Males,Humanities & Social Sciences,481.0
3,1993,Males,Mass Communication,0.0
4,1993,Males,Accountancy,295.0


#### Convert it back to int with np.int()

In [45]:
df_new['no_of_graduates'] = df_new['no_of_graduates'].apply(np.int)
df_new.head()

Unnamed: 0,year,sex,type_of_course,no_of_graduates
0,1993,Males,Education,0
1,1993,Males,Applied Arts,0
2,1993,Males,Humanities & Social Sciences,481
3,1993,Males,Mass Communication,0
4,1993,Males,Accountancy,295


#### Use np.sum() to sum values in an array. In this case, let's find out the total number of university graduates in Singapore from 1993 to 2014.

##### First, replace 'na' with '0' (done above), create a series for no_of_graduates, then use np.sum(series)

In [50]:
series = df_new['no_of_graduates']
print(np.sum(series))

236992


### np.arange()

`np.arange()` creates an array with (start, end, step)

This can be used in plots to state the range and steps of the axis.

In [51]:
np.arange(1,10,1)

array([1, 2, 3, 4, 5, 6, 7, 8, 9])