## Introduction to NumPy and Pandas 

We will start with NumPy. 

NumPy (Numerical Python) is an open-source Python library. It’s the universal standard for working with numerical data in Python. NumPy is used extensively in Pandas, SciPy, Matplotlib, scikit-learn, scikit-image and most other data science and scientific Python packages.

The NumPy library contains multidimensional array and matrix data structures. It provides ndarray, a homogeneous n-dimensional array object, with methods to **efficiently operate** on it. NumPy can be used to perform a wide variety of mathematical operations on arrays. It adds powerful data structures to Python that guarantee efficient calculations with arrays and matrices and it supplies an enormous library of high-level mathematical functions that operate on these arrays and matrices.

The above packages, and more, use NumPy arrays. Even though these packages typically support other inputs (lists, dictionarys, rows and/or columns in a dataframe...), they convert such input to NumPy arrays prior to processing, and they often output NumPy arrays.

Thus, in order to efficiently use the packages we will be using throughout the course, a fundamental understanding of NumPy arrays is important.

Helpful tutorials:

https://numpy.org/learn/

https://towardsdatascience.com/the-ultimate-beginners-guide-to-numpy-f5a2f99aef54

In [123]:
# importing the library

import numpy as np

### Creating an array

**What is an array?**

An array is a homogeneous, fixed-size collection of items.
* Homogeneous, so all items are of the same data type
* Fixed-size meaning once intialized with a certain numbers of rows/columns, it cannot be changed. The workaround is that a new array is created to hold the changed old array.

A helpful guide to visualizing arrays: https://betterprogramming.pub/numpy-illustrated-the-visual-guide-to-numpy-3b1d4976de1d 

In [124]:
# Creating our first array

a = np.array([1,2,3], dtype="int32")
print(a)

print("This array has", a.ndim, "dimension")
# A vector: 1D array


print("This array is of size", a.size) # Number of elements

[1 2 3]
This array has 1 dimension
This array is of size 3


In [125]:
b = np.array([[9.0,8.0,7.0],[1.0,2.0,3.0]])
print(b)

print("This array has", b.ndim, "dimensions")
# A matrix: 2D array

print("This array has shape of", b.shape)
print("This array is of size", b.size) # Number of elements

[[9. 8. 7.]
 [1. 2. 3.]]
This array has 2 dimensions
This array has shape of (2, 3)
This array is of size 6


NumPy’s array class is called ndarray. 

ndarray.ndim will tell you the number of axes, or dimensions, of the array.

ndarray.size will tell you the total number of elements of the array. This is the product of the elements of the array’s shape.

ndarray.shape will display a tuple of integers that indicate the number of elements stored along each dimension of the array.

Note that the NumPy array tries to conform all elements to the same data type.

You can specify the data type of the array elements using the dtype parameter (play around with the **a** ndarray).

In [126]:
#Conforming all elements to float; note that the default is float.

c = np.array([[9,8,7.0],[1.0,2.0,3.0]])

print(c)

# Try bypassing that by using the dtype parameter.

[[9. 8. 7.]
 [1. 2. 3.]]


In [127]:
a = np.array([[1,2,3,4,5,6,7],[8,9,10,11,12,13,14]])

print(a)
print(a.shape)

[[ 1  2  3  4  5  6  7]
 [ 8  9 10 11 12 13 14]]
(2, 7)


In [128]:
# Getting a specific element [r,c]
# Remember indexing starts at 0

a[0,5] #First row, sixth column

6

In [129]:
# Get a specific row 
a[0, :]

array([1, 2, 3, 4, 5, 6, 7])

In [130]:
# Get a specific column
a[:, 2]

array([ 3, 10])

In [131]:
# Remember indexing parameters [startindex:endindex:stepsize]
a[0, 1:-1:2] #first row, column parameters

array([2, 4, 6])

In [132]:
# Changing arrays' element 

a[0,5] = 10

a[0,5]

10

In [133]:
# You can also build n-dimension arrays

b = np.array([[[1,2],[3,4]],[[5,6],[7,8]]])
print(b) 

# An array that has 2-D arrays (matrices) as its elements is called 3-D array.

[[[1 2]
  [3 4]]

 [[5 6]
  [7 8]]]


In [134]:
# Get specific element (works outside in with 3D arrays)
b[0,1,1] #The first set, second row, second column

4

### Initializing Different Types of Arrays

In [135]:
# All 0s matrix
np.zeros((2,3))

array([[0., 0., 0.],
       [0., 0., 0.]])

In [136]:
# All 1s matrix
np.ones((2,3), dtype ='int32')

array([[1, 1, 1],
       [1, 1, 1]])

In [137]:
# Any other number
np.full((2,2), 100)

array([[100, 100],
       [100, 100]])

In [138]:
# Random decimal numbers
np.random.rand(4,2)

array([[0.50715212, 0.26001318],
       [0.5367014 , 0.95586195],
       [0.09880474, 0.05809008],
       [0.21444974, 0.64236425]])

In [139]:
# Random Integer values
np.random.randint(-4,8, size=(3,3)) #range of values from -4 to 8

array([[ 7,  6,  5],
       [-4, -3,  1],
       [-4,  2,  6]])

In [140]:
# Identity matrix
np.identity(5) # 5 x 5 matrix

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

### Array functions 
To learn more about the most common array functions, go to the following link: 
https://numpy.org/doc/stable/reference/routines.html and https://numpy.org/doc/stable/user/index.html

With arrays, you can add elements & delete elements (note that it basically creates a new array), transpose array, split array, joining/merging arrays, and you can also reshape the array. And more. Note that reshaping the array can only be done if both shapes/format of the array have the same number of elements. Remember that the number of elements is calculated as row * columns.

### An example of some functions

In [141]:
before = np.array([[1,2,3,4],[5,6,7,8]])
print(before)

after = before.reshape((4,2))
print(after)

[[1 2 3 4]
 [5 6 7 8]]
[[1 2]
 [3 4]
 [5 6]
 [7 8]]


In [142]:
# Vertical Stack
v1 = np.array([1,2,3,4])
v2 = np.array([5,6,7,8])

np.vstack([v1,v2,v1,v2])

array([[1, 2, 3, 4],
       [5, 6, 7, 8],
       [1, 2, 3, 4],
       [5, 6, 7, 8]])

In [143]:
# Horizontal Stack
h1 = np.ones((2,4))
h2 = np.zeros((2,2))

np.hstack((h1,h2))

array([[1., 1., 1., 1., 0., 0.],
       [1., 1., 1., 1., 0., 0.]])

In [144]:
# For more on sorting: https://numpy.org/doc/stable/reference/routines.sort.html 
a = np.array([[19, 1], [15, 94]])
print ("The input array is : \n", a)

# sorting without a defined axis parameter, sorts elements withing each row, using the column axis
a_sorted =  np.sort(a)
print("A sorted array is : \n", a_sorted)

#sorting along the last axis, same as above
a_sorted = np.sort(a, axis = -1)
print ("Sorted array along the last axis : \n", a_sorted)

# sorting along the first axis/the row
a_sorted = np.sort(a, axis = 0)
print ("Sorted array along the first axis : \n", a_sorted)


#sorting the flattened axis
a_sorted = np.sort(a, axis = None)
print ("Sorted array when flattened: \n", a_sorted)

# More on sorting can be found here: https://www.educba.com/numpy-sort/?source=leftnav 

The input array is : 
 [[19  1]
 [15 94]]
A sorted array is : 
 [[ 1 19]
 [15 94]]
Sorted array along the last axis : 
 [[ 1 19]
 [15 94]]
Sorted array along the first axis : 
 [[15  1]
 [19 94]]
Sorted array when flattened: 
 [ 1 15 19 94]


### Mathematics

For more visit: https://docs.scipy.org/doc/numpy/reference/routines.math.html

In [145]:
a = np.array([1,2,3,4])
print(a)

[1 2 3 4]


In [146]:
a + 2

array([3, 4, 5, 6])

In [147]:
a - 2

array([-1,  0,  1,  2])

In [148]:
a * 2

array([2, 4, 6, 8])

In [149]:
a / 2

array([0.5, 1. , 1.5, 2. ])

In [150]:
a ** 2

array([ 1,  4,  9, 16], dtype=int32)

In [151]:
a = np.array([1,2,3,4])
b = np.array([1,0,1,0])

a + b

array([2, 2, 4, 4])

In [152]:
np.cos(a)

array([ 0.54030231, -0.41614684, -0.9899925 , -0.65364362])

### Linear Algebra
For more visit: https://docs.scipy.org/doc/numpy/reference/routines.linalg.html 

In [153]:
# Matrix product of 2 arrays

a = np.ones((2,3))
print(a)

b = np.full((3,2), 2)
print(b)

np.matmul(a,b)

# Note that with 2D arrays, np.dot() does the same things, but the np.matmul is better
# https://numpy.org/doc/stable/reference/generated/numpy.dot.html 

[[1. 1. 1.]
 [1. 1. 1.]]
[[2 2]
 [2 2]
 [2 2]]


array([[6., 6.],
       [6., 6.]])

In [154]:
# Find the determinant
# Note that in mathematics, the determinant is a scalar value that is a function of the entries of a square matrix. 
c = np.identity(3)
np.linalg.det(c)

1.0

### Statisics

In [155]:
a = np.array([[1,2,3],[4,6,5]])
print(a)
np.min(a)

[[1 2 3]
 [4 6 5]]


1

In [156]:
np.max(a, axis = 1) # Of each row

array([3, 6])

In [157]:
np.max(a, axis = 0) # Of each column

array([4, 6, 5])

In [158]:
np.sum(a, axis = 1) # Of each row's elements

array([ 6, 15])

In [159]:
np.sum(a, axis = 0) # Of each column's elements

array([5, 8, 8])

### Loading Data from file

In [160]:
filedata = np.genfromtxt('data.txt', delimiter=',')
filedata = filedata.astype('int32')
print(filedata)

[[  1  13  21  11 196  75   4   3  34   6   7   8   0   1   2   3   4   5]
 [  3  42  12  33 766  75   4  55   6   4   3   4   5   6   7   0  11  12]
 [  1  22  33  11 999  11   2   1  78   0   1   2   9   8   7   1  76  88]]


## pandas 

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool.

It offers:
* A fast and efficient DataFrame object for data manipulation 
* Tools for reading and writing data 
* Intelligent data alignment and integrated handling of missing data
* Flexible reshaping and pivoting of datasets
* Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
* Time series-functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;

Documentation: 

https://pandas.pydata.org/docs/user_guide/index.html

Helpful tutorials:

https://pandas.pydata.org/docs/getting_started/tutorials.html

https://pandas.pydata.org/docs/getting_started/tutorials.html

### pandas data structures

To summarize, there are two main pandas data structures: Series and DataFrame.

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. 

In this tutorial, we will focus on using pandas DataFrame with real world data. You can build your own DataFrame; however, we don’t usually create our own dataframes. 

Instead, we read explore, manipulate and visualise data in Pandas by importing data to a DataFrame.Pandas can read from multiple formats, but the usual one is Comma Separated Values, csv. 

For a more thorough introduction the pandas' data structures, view:
https://pandas.pydata.org/docs/user_guide/dsintro.html 

### Data for pandas example 

One of the most famous datasets that people use to get into Machine Learning is the Titanic dataset. 

Link to data: https://www.kaggle.com/c/titanic/data?select=train.csv 

The main objective of this dataset is to study what are the factors that affect the survivability of a person onboard the titanic. 

In [161]:
import pandas as pd

In [162]:
df = pd.read_csv('/Users/noura/Desktop/Work/FEPS/2022-2023/P420 Data Science in PoliSci/TA Sessions/train.csv')

# Replace the file path with your own
# This function transforms csv data into a pandas DataFrame

df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [163]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [164]:
df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [165]:
# Summary of the dataset

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### Breakdown of the variables: 

survived:    Survival 

PassengerId: Unique Id of a passenger. 

pclass:    Ticket class     

sex:    Sex     

Age:    Age in years   

sibsp:    # of siblings / spouses aboard the Titanic 

parch:    # of parents / children aboard the Titanic

ticket:    Ticket number   

fare:    Passenger fare  

cabin:    Cabin number     

embarked:    Port of Embarkation

#### Get the unique values for each column

This could be especially helpful in quickly determining how clean some categorical variables are/pinpointing categorical variables. 

It could also be helpful in pinpointing how clean variables, where each value has to be unique, are.

In [166]:
df.nunique()

PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64

#### Inspecting/Selecting certain columns

In [167]:
# You can inspect a certain column on its own

df['Pclass']

0      3
1      1
2      3
3      1
4      3
      ..
886    2
887    1
888    3
889    1
890    3
Name: Pclass, Length: 891, dtype: int64

In [168]:
# Or multiple columns 

df[['Pclass', 'Parch', 'Sex']]

Unnamed: 0,Pclass,Parch,Sex
0,3,0,male
1,1,0,female
2,3,0,female
3,1,0,female
4,3,0,male
...,...,...,...
886,2,0,male
887,1,0,female
888,3,2,female
889,1,0,male


#### Including/Excluding Certain columns

For example, in this dataset, the name, ticket, and passenger_id of a passenger play no role in whether the passenger survived or not.

To exclude these columns:

In [169]:
df1 = df[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin', 'Embarked']]
df1

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,7.2500,,S
1,1,1,female,38.0,1,0,71.2833,C85,C
2,1,3,female,26.0,0,0,7.9250,,S
3,1,1,female,35.0,1,0,53.1000,C123,S
4,0,3,male,35.0,0,0,8.0500,,S
...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,,S
887,1,1,female,19.0,0,0,30.0000,B42,S
888,0,3,female,,1,2,23.4500,,S
889,1,1,male,26.0,0,0,30.0000,C148,C


In [170]:
df.drop(['Name', 'Ticket'], axis = 1, inplace=True)
df
# axis parameter determins rows or columns, inplace parameter tells Python to do what is request to the original variable

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,1,0,3,male,22.0,1,0,7.2500,,S
1,2,1,1,female,38.0,1,0,71.2833,C85,C
2,3,1,3,female,26.0,0,0,7.9250,,S
3,4,1,1,female,35.0,1,0,53.1000,C123,S
4,5,0,3,male,35.0,0,0,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,male,27.0,0,0,13.0000,,S
887,888,1,1,female,19.0,0,0,30.0000,B42,S
888,889,0,3,female,,1,2,23.4500,,S
889,890,1,1,male,26.0,0,0,30.0000,C148,C


#### Inspecting/Selecting certain rows

In [171]:
df.iloc[500:511]

# Remember ranges are exclusive

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
500,501,0,3,male,17.0,0,0,8.6625,,S
501,502,0,3,female,21.0,0,0,7.75,,Q
502,503,0,3,female,,0,0,7.6292,,Q
503,504,0,3,female,37.0,0,0,9.5875,,S
504,505,1,1,female,16.0,0,0,86.5,B79,S
505,506,0,1,male,18.0,1,0,108.9,C65,C
506,507,1,2,female,33.0,0,2,26.0,,S
507,508,1,1,male,,0,0,26.55,,S
508,509,0,3,male,28.0,0,0,22.525,,S
509,510,1,3,male,26.0,0,0,56.4958,,S


#### Conditional Selection 

Allows us to filter through the data, for example only data of male passengers or female passengers.

In [172]:
df[df['Sex'] == 'male']

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,1,0,3,male,22.0,1,0,7.2500,,S
4,5,0,3,male,35.0,0,0,8.0500,,S
5,6,0,3,male,,0,0,8.4583,,Q
6,7,0,1,male,54.0,0,0,51.8625,E46,S
7,8,0,3,male,2.0,3,1,21.0750,,S
...,...,...,...,...,...,...,...,...,...,...
883,884,0,2,male,28.0,0,0,10.5000,,S
884,885,0,3,male,25.0,0,0,7.0500,,S
886,887,0,2,male,27.0,0,0,13.0000,,S
889,890,1,1,male,26.0,0,0,30.0000,C148,C


How it works:

The command df[‘Sex’] == ‘male’ will return a Boolean for each row. Nesting a df[] over it will return the whole dataset for male passengers.

We can combine what we have learned thus far:

In [173]:
# Inspecting columns Pclass and sex, for those with value male for variable sex > for the 10 rows from 500-511 of the male passengers
df[['Pclass','Sex']][df['Sex'] == 'male'].iloc[500:511]

Unnamed: 0,Pclass,Sex
773,3,male
775,3,male
776,3,male
778,3,male
782,1,male
783,3,male
784,3,male
785,3,male
787,3,male
788,3,male


#### Aggregation Functions/Statistics

Note that only numerical columns will be included in these mathematical functions. Other columns will be automatically excluded.

In [174]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [175]:
df.max()

  df.max()


PassengerId         891
Survived              1
Pclass                3
Sex                male
Age                80.0
SibSp                 8
Parch                 6
Fare           512.3292
dtype: object

In [176]:
df[['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']].max()

Survived      1.0000
Pclass        3.0000
Age          80.0000
SibSp         8.0000
Parch         6.0000
Fare        512.3292
dtype: float64

In [177]:
df[['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']].min()

Survived    0.00
Pclass      1.00
Age         0.42
SibSp       0.00
Parch       0.00
Fare        0.00
dtype: float64

In [178]:
df[['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']].median()

Survived     0.0000
Pclass       3.0000
Age         28.0000
SibSp        0.0000
Parch        0.0000
Fare        14.4542
dtype: float64

In [179]:
df.count() #How many non-NA values in each column

PassengerId    891
Survived       891
Pclass         891
Sex            891
Age            714
SibSp          891
Parch          891
Fare           891
Cabin          204
Embarked       889
dtype: int64

In [180]:
df[['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']].std()

Survived     0.486592
Pclass       0.836071
Age         14.526497
SibSp        1.102743
Parch        0.806057
Fare        49.693429
dtype: float64

In [181]:
#Get the mean of the relevant variables for the rows 500-511

df[['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']].iloc[500:511].mean()

Survived     0.454545
Pclass       2.363636
Age         25.000000
SibSp        0.090909
Parch        0.181818
Fare        33.486364
dtype: float64

In [182]:
# Correlation matrix 

df.corr()

# Interpret the survived column/row

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


We can also check the distribution of each column in a table format.

In [183]:
df['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

### Data cleaning

We have done enough exploration to understand our data, now before moving on to any type of analysis (i.e., regression, applying a ML algorithm, visualization of any type...etc.), we must clean our data. 

Cleaning data is challenging, needs a lot of practice, and is a very vital step.

Typically, we need to deal with null values, empty values, incorrect timestamps, wrong categorization/non-unified spelling and much more. What you do in each case is based on your data and interest.

#### Determining and dealing with NA values

The df.info() command has already told us there are null values. Let's explore this further

In [184]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Sex              0
Age            177
SibSp            0
Parch            0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [185]:
df[df['Age'].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
5,6,0,3,male,,0,0,8.4583,,Q
17,18,1,2,male,,0,0,13.0000,,S
19,20,1,3,female,,0,0,7.2250,,C
26,27,0,3,male,,0,0,7.2250,,C
28,29,1,3,female,,0,0,7.8792,,Q
...,...,...,...,...,...,...,...,...,...,...
859,860,0,3,male,,0,0,7.2292,,C
863,864,0,3,female,,8,2,69.5500,,S
868,869,0,3,male,,0,0,9.5000,,S
878,879,0,3,male,,0,0,7.8958,,S


You can drop the NA values, or you can choose a way to fill these NA values. For more on missing data or NA data,visit https://pandas.pydata.org/docs/user_guide/missing_data.html 

For the Age variable, we will take the mean of all ages and replace NA with it. This is a good way to handle null values as it doesn’t mess with the skewness of the values. Note on outliers with ML/computational methods (telling signs not to be removed).

We will drop the cabin column as it is a very significant % of missing values, may be better to retain our observations, but drop this value.

We will drop rows with missing embarked values.

In [186]:
df['Age'] = df['Age'].fillna(df['Age'].mean())

In [187]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Sex              0
Age              0
SibSp            0
Parch            0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [188]:
df.drop(['Cabin'], axis = 1, inplace=True)

In [189]:
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       2
dtype: int64

In [190]:
df.dropna(inplace=True)

In [191]:
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       0
dtype: int64

In [192]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  889 non-null    int64  
 1   Survived     889 non-null    int64  
 2   Pclass       889 non-null    int64  
 3   Sex          889 non-null    object 
 4   Age          889 non-null    float64
 5   SibSp        889 non-null    int64  
 6   Parch        889 non-null    int64  
 7   Fare         889 non-null    float64
 8   Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(2)
memory usage: 69.5+ KB


### GroupBy
Allows us to explore the data through groups.

### Concatenation and Merging

### Data Manipulation
* Data types
* Using functions
* Creating custom columns

### Exporting Data/Saving