### Python libraries

Similar to the physical libraries, these are a collection of reusable resources. 

Python Libraries are a set of useful functions that eliminate the need for writing codes from scratch.

There are over 137,000 python libraries present today.

Example:

Pandas, Numpy, Scikit-learn, MatPlot, Seaborn etc.

#### What is PIP?

PIP is a package manager for Python packages, or modules if you like.

#### How to Check all Pre-installed on Your Computer!

Use either "pip list" or "'pip freeze".

Both  pip list  and  pip freeze  will generate a list of installed packages, just with differently formatted results. Keep in mind that  pip list  will list ALL installed packages (regardless of how they were installed). 

while  pip freeze  will list only everything installed by Pip.

#### How to install a Package

'''
pip install package name
'''

- Exercise: Install Pandas, Numpy, Scikit-learn, MatPlotlib, Seaborn

#### Importing a Library

In [71]:
import pandas as pd    #import pandas
import numpy as np    #import numpy
import matplotlib.pyplot as plt     #import matplot
import seaborn as sns

### Series and Dataframes

Series is a type of list in pandas which can take integer values, string values, double values and more. But in Pandas Series we return an object in the form of list, having index starting from 0 to n, Where n is the length of values in series.

Series can only contain single list with index, whereas dataframe can be made of more than one series or we can say that a dataframe is a collection of series that can be used to analyse the data.

A DataFrame is a two dimensional object that can have columns with potential different types. 

A Dataframe is the most commonly used pandas object.

In [73]:
import pandas as pd   #this is redundant - because we already imported Pandas above
  
author = ['Jitender', 'Purnima', 'Arpit', 'Jyoti']
  
auth_series = pd.Series(author)

auth_series

0    Jitender
1     Purnima
2       Arpit
3       Jyoti
dtype: object

In [47]:
print(type(auth_series))

<class 'pandas.core.series.Series'>


In [76]:
auth_series[0]    #indexing on series oblects

'Jitender'

In [53]:
auth_series + auth_series   #vectorised operations possible on series

0    JitenderJitender
1      PurnimaPurnima
2          ArpitArpit
3          JyotiJyoti
dtype: object

In [77]:
#creating a pandas dataframe from 2 series

import pandas as pd    #!!!!!!
  
author = ['Jitender', 'Purnima', 'Arpit', 'Jyoti']
article = [210, 211, 114, 178]
  
auth_series = pd.Series(author)  #generate a series object
article_series = pd.Series(article)  #generate a series object
  
frame = { 'Author': auth_series, 'Article': article_series }
  
result = pd.DataFrame(frame)  #generate a dataframe 
  
result.head()

Unnamed: 0,Author,Article
0,Jitender,210
1,Purnima,211
2,Arpit,114
3,Jyoti,178


In [110]:
#import data

df = pd.read_csv('./data/test.csv')   #imported the data

In [111]:
df.head()    #head shows the first 5 rows of the dataframe

Unnamed: 0,height,sex_no,shoe_size,date_time
0,127,1,33,2020-05-25 08:28:46
1,175,1,44,2020-05-07 22:33:13
2,182,1,42,2020-05-04 09:29:52
3,165,2,41,2020-04-29 22:26:47
4,173,2,41,2020-04-27 22:31:50


In [100]:
df.loc[0:3,['sex_no','shoe_size','date_time']]

Unnamed: 0,sex_no,shoe_size,date_time
0,1,33,2020-05-25 08:28:46
1,1,44,2020-05-07 22:33:13
2,1,42,2020-05-04 09:29:52
3,2,41,2020-04-29 22:26:47


In [92]:
df.tail()     #shows the last 5 rows of the dataframe

Unnamed: 0,height,sex_no,shoe_size,date_time
59,162,1,40,2019-11-27 11:47:51
60,175,2,39,2019-11-27 11:42:25
61,162,1,40,2019-11-26 14:10:09
62,156,2,38,2019-11-25 23:27:04
63,152,2,39,2019-11-25 23:18:26


#### loc and iloc

- loc is label-based, which means that we have to specify the name of the rows and columns that we need to filter out.

For example, let’s say we search for the rows whose index is 1, 2 or 100. We will not get the first, second or the hundredth row here. Instead, we will get the results only if the name of any index is 1, 2 or 100.

So, we can filter the data using the loc function in Pandas even if the indices are not an integer in our dataset.

- On the other hand, iloc is integer index-based. So here, we have to specify rows and columns by their integer index.

Note:

```
loc[row_label, column_label]

```
```
iloc[row_position, column_position]

```

![Loc_and_iloc](./img/loc_iloc.png)

Example:

In [102]:
df.head()

Unnamed: 0,height,sex_no,shoe_size,date_time
0,127,1,33,2020-05-25 08:28:46
1,175,1,44,2020-05-07 22:33:13
2,182,1,42,2020-05-04 09:29:52
3,165,2,41,2020-04-29 22:26:47
4,173,2,41,2020-04-27 22:31:50


In [104]:
df.loc[0,'sex_no']  #we count row 0 and column name 'height' 

1

In [105]:
df.iloc[0,1]  #remeber python starts counting at 0

1

In [30]:
df.loc[0:2,['height','sex_no']]

Unnamed: 0,height,sex_no
0,127,1
1,175,1
2,182,1


In [107]:
df.iloc[0:3,0:2]   #remember the last number is not counted.

Unnamed: 0,height,sex_no
0,127,1
1,175,1
2,182,1


Exercise:

1. Import the weather_sheet data
2. set the Day Column as index col
3. What is the weather on Saturday like? With Python code
4. What is the humidity on Wednesday?
5. What is the Wind on Friday to Sunday?
6. Can you show only the Temperature for Monday to Sunday using loc or iloc?

In [128]:
#Lets add one more series as a column to our dataframe

import pandas as pd
  
author = ['Jitender', 'Purnima', 'Arpit', 'Jyoti']
article = [210, 211, 114, 178]
  
auth_series = pd.Series(author)
article_series = pd.Series(article)
  
frame = { 'Author': auth_series, 'Article': article_series }
  
result = pd.DataFrame(frame)
age = [21, 21, 23]
  
result['Age'] = pd.Series(age)
  
result.head()

Unnamed: 0,Author,Article,Age
0,Jitender,210,21.0
1,Purnima,211,21.0
2,Arpit,114,23.0
3,Jyoti,178,


##### notice the NaN ---> That's the missing value!

### Dealing with Missing Values in a Pandas DataFrame

"NaN" is the default missing value marker for reasons of computational speed and convenience, we need to be able to easily detect this value with data of different types: floating point, integer, boolean, and general object.

#### Finding Missing Values

Pandas provides "isnull()", "isna()" functions to detect missing values. Both of them do the same thing.

df.isna() returns the dataframe with boolean values indicating missing values.

Example:

In [114]:
result.isna()

Unnamed: 0,Author,Article,Age
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,True


In [115]:
result.isnull()

Unnamed: 0,Author,Article,Age
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,True


In [116]:
result.notna()

Unnamed: 0,Author,Article,Age
0,True,True,True
1,True,True,True
2,True,True,True
3,True,True,False


You can also choose to use notna() which is just the opposite of isna().

df.isna().any() returns a boolean value for each column. If there is at least one missing value in that column, the result is True.

df.isna().sum() returns the number of missing values in each column.

In [10]:
result.isna().any()   #returns a boolean if there are any missing values in any of the columns

Author     False
Article    False
Age         True
dtype: bool

In [11]:
result.isna().sum()  #returns the sum of missing values in each column

Author     0
Article    0
Age        1
dtype: int64

#### Replacing Missing Values

There is not an optimal way to handle missing values. 

Depending on the characteristics of the dataset and the task, we can choose to:

- Drop missing values
- Replace missing values

##### Drop missing values

We can drop a row or column with missing values using "dropna()" function. how parameter is used to set condition to drop.

how=’any’ : drop if there is any missing value
how=’all’ : drop if all values are missing

Note: inplace parameter saves the changes in the dataframe. 

Default value for inplace is False so if it is set it to True, changes will not be saved.

axis parameter is used to select row (0) or column (1).

In [126]:
result.dropna(axis=1, how='all', inplace=False)  #axis specify which of the row or column we want to drop [0 -- row]

Unnamed: 0,Author,Article,Age
0,Jitender,210,21.0
1,Purnima,211,21.0
2,Arpit,114,23.0


In [124]:
result.dropna(axis=0)  #axis specify which of the row or columns we want to drop [0 -- row]

The thresh parameter helps to drop a column or row if the number of missing values exceeds the threshold.

For Example:

Setting thresh parameter to 3 dropped rows with at least 3 missing values.

In [25]:
result.dropna(axis=0, thresh=3) 

Unnamed: 0,Author,Article,Age
0,Jitender,210,21.0
1,Purnima,211,21.0
2,Arpit,114,23.0


#### Replacing missing values

fillna() function of Pandas conveniently handles missing values. 

Using fillna(), missing values can be replaced by a special value or an aggreate value such as mean, median.

In [129]:
result

Unnamed: 0,Author,Article,Age
0,Jitender,210,21.0
1,Purnima,211,21.0
2,Arpit,114,23.0
3,Jyoti,178,


In [130]:
result.fillna(12)

Unnamed: 0,Author,Article,Age
0,Jitender,210,21.0
1,Purnima,211,21.0
2,Arpit,114,23.0
3,Jyoti,178,12.0


In [30]:
mean_age = result.Age.mean()   #column mean 

print(mean_age)

result['Age'].fillna(mean_age)    #dealing with missing values SPECIFICALLY with the Age column

21.666666666666668


0    21.000000
1    21.000000
2    23.000000
3    21.666667
Name: Age, dtype: float64

In [32]:
result.fillna(mean_age)    #dealing with missing values from the COMPLETE dataframe

Unnamed: 0,Author,Article,Age
0,Jitender,210,21.0
1,Purnima,211,21.0
2,Arpit,114,23.0
3,Jyoti,178,21.666667


Using method parameter, missing values can be replaced with the values before or after them.

"ffill" stands for “forward fill” replaces missing values with the values in the previous row. 

You can also choose "bfill" which stands for “backward fill”.

Note: If there are many consecutive missing values in a column or row, you may want to limit the number of missing values to be forward or backward filled.

In [35]:
result.fillna(axis=0,method='ffill', limit=1)

Unnamed: 0,Author,Article,Age
0,Jitender,210,21.0
1,Purnima,211,21.0
2,Arpit,114,23.0
3,Jyoti,178,23.0


In [131]:
result.fillna(axis=0,method='bfill', limit=1)

Unnamed: 0,Author,Article,Age
0,Jitender,210,21.0
1,Purnima,211,21.0
2,Arpit,114,23.0
3,Jyoti,178,


#### Dealing with Obscure errors

Real world data is messy. Data cleaning is a major part of every data science project.

In [60]:
df_missing = pd.read_csv('./data/people_info.csv',index_col=0)

df_missing.head()

Unnamed: 0,First Name,Gender,Salary,Bonus %,Senior Management,Team
995,??,,132483,16.655,False,Distribution
996,Phillip,Male,42392,19.675,False,Finance
997,Russell,Unknown,96914,1.421,False,Product
998,Larry,Male,Twnty thousanf,11.985,False,Business Development
999,Albert,Male,129949,10.169,Not True,Sales


In [150]:
df_missing.loc[995,'First Name'] = 'Oluwakitan'

df_missing.iloc[2,1] = 'Male'

df_missing.loc[998,'Salary'] = int(0)

In [156]:
df_missing['Salary'].astype(int).mean()

80347.6

In [159]:
df_missing.loc[995,'First Name'] = ''

In [160]:
df_missing.head()

Unnamed: 0,First Name,Gender,Salary,Bonus %,Senior Management,Team
995,,,132483,16.655,False,Distribution
996,Phillip,Male,42392,19.675,False,Finance
997,Russell,Male,96914,1.421,False,Product
998,Larry,Male,0,11.985,False,Business Development
999,Albert,Male,129949,10.169,Not True,Sales


In [133]:
df_missing.isna().sum()

First Name           0
Gender               1
Salary               0
Bonus %              0
Senior Management    0
Team                 0
dtype: int64

In [None]:
#clean up the data!

Exercise:

1. Open and read the employees.csv dataset
2. Find the sum of all missing values in the dataset
3. Clean the data