### Importing Pandas Libraries
Importing Pandas library and output the version. Apart from that, importing the Numpy library because we will use several of its useful functions.

In [1]:
import pandas as pd
pd.__version__

'0.23.0'

In [2]:
import numpy as np

### Creating Series object 
A pandas series object is used to create a single row of data, analogous to an array.

In [3]:
series_obj = pd.Series(np.arange(5), index=['row 1', 'row 2','row 3','row 4','row 5'])
series_obj

row 1    0
row 2    1
row 3    2
row 4    3
row 5    4
dtype: int32

**Addressing Data Elements :**
We can address data elements by 1. its **label index** or 2. its **integer index**.

When we write square brackets with a label-index inside them, this tells Python to select and retrieve all records with that label-index.

When we write square brackets with an integer index inside them, this tells Python to select and retrieve all records with the specified integer index.

In [4]:
series_obj['row 4']

3

Note that, all series objects are zero index. And our string labels that we put for this series_obj is starting as "row_1".

In [5]:
series_obj[[0,3]]

row 1    0
row 4    3
dtype: int32

### Creating a Dataframe
We can create dataframes from other series objects.

In [6]:
city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])

dataframe = pd.DataFrame({ 'City name': city_names, 'Population': population })
dataframe

Unnamed: 0,City name,Population
0,San Francisco,852469
1,San Jose,1015785
2,Sacramento,485199


 Any string labels we provide to the series objects is taken as column index. We can name the rows and columns too. we do by providing a list to **index** parameter and another to the **'columns'** parameter.

In [7]:
np.random.seed(42)
DF_obj = pd.DataFrame(np.random.rand(36).reshape((6,6)), 
                   index=['row 1', 'row 2', 'row 3', 'row 4', 'row 5', 'row 6'],
                   columns=['column 1', 'column 2', 'column 3', 'column 4', 'column 5', 'column 6'])
DF_obj

Unnamed: 0,column 1,column 2,column 3,column 4,column 5,column 6
row 1,0.37454,0.950714,0.731994,0.598658,0.156019,0.155995
row 2,0.058084,0.866176,0.601115,0.708073,0.020584,0.96991
row 3,0.832443,0.212339,0.181825,0.183405,0.304242,0.524756
row 4,0.431945,0.291229,0.611853,0.139494,0.292145,0.366362
row 5,0.45607,0.785176,0.199674,0.514234,0.592415,0.04645
row 6,0.607545,0.170524,0.065052,0.948886,0.965632,0.808397


**Accessing data elements :** Data elements can be accessed my passing the row and column index to the .loc[] in square brakets. Multiple elements can be accessed as well by passing multiple elements. Note: the .ix[] command is deprecated.

In [8]:
DF_obj.loc[['row 2'], ['column 3']]

Unnamed: 0,column 3
row 2,0.601115


In [9]:
DF_obj.loc[['row 2', 'row 5'], ['column 5', 'column 2']]

Unnamed: 0,column 5,column 2
row 2,0.020584,0.866176
row 5,0.592415,0.785176


### Dealing with missing values
A clean or well formed dataset is quite rare. Most time we will get datasets with big anomalies or missing data. Pandas has some nice methods to deal with missing values.

Here we indicate missing values as **NaN** which is available from numpy.

In [10]:
NaN = np.nan

In [11]:
series_obj = pd.Series(['row 1', 'row 2', NaN, 'row 4','row 5', 'row 6', NaN, 'row 8'])
series_obj

0    row 1
1    row 2
2      NaN
3    row 4
4    row 5
5    row 6
6      NaN
7    row 8
dtype: object

Checking which values are NaN.

In [12]:
series_obj.isnull()

0    False
1    False
2     True
3    False
4    False
5    False
6     True
7    False
dtype: bool

Lets create a dataframe and create some missing values in it.

In [13]:
np.random.seed(42)
DF_obj = pd.DataFrame(np.random.randn(36).reshape(6,6))
DF_obj.loc[3:5, 0] = NaN
DF_obj.loc[1:4, 5] = NaN
DF_obj

Unnamed: 0,0,1,2,3,4,5
0,0.496714,-0.138264,0.647689,1.52303,-0.234153,-0.234137
1,1.579213,0.767435,-0.469474,0.54256,-0.463418,
2,0.241962,-1.91328,-1.724918,-0.562288,-1.012831,
3,,-1.412304,1.465649,-0.225776,0.067528,
4,,0.110923,-1.150994,0.375698,-0.600639,
5,,1.852278,-0.013497,-1.057711,0.822545,-1.220844


### Strategies for filling missing values
There are several strategies for dealing with missing values.

**Strategy 1 : ** 

The .fillna function() finds each missing value from within a Pandas object and fills it with the numeric value that you've passed in the function.

In [14]:
filled_NA_DF = DF_obj.fillna(42) # Filling with a particular value. Mostl we can fill with 0 or some other value
filled_NA_DF

Unnamed: 0,0,1,2,3,4,5
0,0.496714,-0.138264,0.647689,1.52303,-0.234153,-0.234137
1,1.579213,0.767435,-0.469474,0.54256,-0.463418,42.0
2,0.241962,-1.91328,-1.724918,-0.562288,-1.012831,42.0
3,42.0,-1.412304,1.465649,-0.225776,0.067528,42.0
4,42.0,0.110923,-1.150994,0.375698,-0.600639,42.0
5,42.0,1.852278,-0.013497,-1.057711,0.822545,-1.220844


We can also pass a dictionary with key equal to column and value equal to the value to fill. Appropriate columns will be filled as per the given values.

In [15]:
filled_column_DF = DF_obj.fillna({0: 24, 5: 42}) # Note the columns are zero indexed.
filled_column_DF

Unnamed: 0,0,1,2,3,4,5
0,0.496714,-0.138264,0.647689,1.52303,-0.234153,-0.234137
1,1.579213,0.767435,-0.469474,0.54256,-0.463418,42.0
2,0.241962,-1.91328,-1.724918,-0.562288,-1.012831,42.0
3,24.0,-1.412304,1.465649,-0.225776,0.067528,42.0
4,24.0,0.110923,-1.150994,0.375698,-0.600639,42.0
5,24.0,1.852278,-0.013497,-1.057711,0.822545,-1.220844


**Fill-forward strategy** The NaN values get replaced by the value inthe cell before the NaN value.

In [16]:
fill_forward_DF = DF_obj.fillna(method='ffill')
fill_forward_DF

Unnamed: 0,0,1,2,3,4,5
0,0.496714,-0.138264,0.647689,1.52303,-0.234153,-0.234137
1,1.579213,0.767435,-0.469474,0.54256,-0.463418,-0.234137
2,0.241962,-1.91328,-1.724918,-0.562288,-1.012831,-0.234137
3,0.241962,-1.412304,1.465649,-0.225776,0.067528,-0.234137
4,0.241962,0.110923,-1.150994,0.375698,-0.600639,-0.234137
5,0.241962,1.852278,-0.013497,-1.057711,0.822545,-1.220844


### Number of NaN values
The number of values (column wise) can be obtained by calling the sum() method on the isnull() method. 

In [17]:
DF_obj.isnull().sum()

0    3
1    0
2    0
3    0
4    0
5    4
dtype: int64

### Weeding out missing value columns
If we want to 'drop' the columns with the dropna() method. We have to provide the axis value. For columns axis=1, for rows, axis = 0

In [18]:
DF_drop_col_NaN = DF_obj.dropna(axis=1)
DF_drop_col_NaN

Unnamed: 0,1,2,3,4
0,-0.138264,0.647689,1.52303,-0.234153
1,0.767435,-0.469474,0.54256,-0.463418
2,-1.91328,-1.724918,-0.562288,-1.012831
3,-1.412304,1.465649,-0.225776,0.067528
4,0.110923,-1.150994,0.375698,-0.600639
5,1.852278,-0.013497,-1.057711,0.822545


In [19]:
DF_drop_row_NaN = DF_obj.dropna(axis=0) # default it will drop rows, even if we don't provide axis value
DF_drop_row_NaN

Unnamed: 0,0,1,2,3,4,5
0,0.496714,-0.138264,0.647689,1.52303,-0.234153,-0.234137


Making all values of a column as NaN.

In [20]:
all_NA_DF = DF_obj
all_NA_DF.loc[[0,5],[5,5]] = NaN
all_NA_DF

Unnamed: 0,0,1,2,3,4,5
0,0.496714,-0.138264,0.647689,1.52303,-0.234153,
1,1.579213,0.767435,-0.469474,0.54256,-0.463418,
2,0.241962,-1.91328,-1.724918,-0.562288,-1.012831,
3,,-1.412304,1.465649,-0.225776,0.067528,
4,,0.110923,-1.150994,0.375698,-0.600639,
5,,1.852278,-0.013497,-1.057711,0.822545,


**Dropping a row or a column with all null values**
The dropna() function can be used. Just use how='all' strategy. And since we want to drop the column, set axis =1.

In [21]:
dropped_col_DF = all_NA_DF.dropna(how='all', axis=1)
dropped_col_DF

Unnamed: 0,0,1,2,3,4
0,0.496714,-0.138264,0.647689,1.52303,-0.234153
1,1.579213,0.767435,-0.469474,0.54256,-0.463418
2,0.241962,-1.91328,-1.724918,-0.562288,-1.012831
3,,-1.412304,1.465649,-0.225776,0.067528
4,,0.110923,-1.150994,0.375698,-0.600639
5,,1.852278,-0.013497,-1.057711,0.822545
