![alt text](https://whatsthebigdata.files.wordpress.com/2017/06/data-science-machine-learning-software-2015-2017.jpg)

##**NumPy**


NumPy(Numerical Python) is a linear algebra library in Python. Why learning NumPy is important:

1.   NumPy is very useful for performing mathematical and logical operations on Arrays. 
2.   NumPy provides an abundance of useful features for operations on n-arrays and matrices in Python.

NumPy is a very important library on which almost every data science or machine learning Python packages such as SciPy(Scientific Python), Mat-plotlib(plotting library), Scikit-learn, etc depends on to a reasonable extent.

Documentation - https://numpy.org/doc/

###**Import Numpy and Pandas Libraries**

In [1]:
import pandas as pd
import numpy as np

In [2]:
print ("NumPy version - " + np.version.version)

NumPy version - 1.19.5


###**Creating Arrays**

![arrays](https://dphi.tech/blog/wp-content/uploads/2021/03/idn2d-array-1024x542.png)

####Creating 1 D *Numpy Arrays*

In [3]:
#create 1D numpy array from a list
array_1d = np.array([1, 2, 3, 4, 5])

print (array_1d)

[1 2 3 4 5]


In [4]:

#create 1D numpy array from a list variable
list_1 = [11, 22, 33, 44, 55]
array_l1 = np.array(list_1)

print(array_l1)

[11 22 33 44 55]


In [5]:
#print the type of an numpy array we created
print ("Array type = " + str(type(array_1d)))

#print the data type 
print ("Data type = " + str(array_1d.dtype))

Array type = <class 'numpy.ndarray'>
Data type = int64


In [6]:
#check the shape of the ndarray we created, it should have only 1 dimension
print(array_1d.shape)

(5,)


In [7]:
type(array_1d)

numpy.ndarray

In [8]:
#Demonstrating the Range function.
plain_range = range(4,12)
list(plain_range)

[4, 5, 6, 7, 8, 9, 10, 11]

Notice how the second number is exclusive. 

In [9]:
interval_range = range(1, 15, 2)

In [10]:
list(interval_range)

[1, 3, 5, 7, 9, 11, 13]

### Self Practice!

1. Create your own Python List using the range function.
2. Convert the list into a NumPy array
3. Check the data type to make sure it is a NumPy array
4. Convert that array back into a list



In [11]:
list(range(15, 1, -2))

[15, 13, 11, 9, 7, 5, 3]

### Convert a NumPy Array into a Pandas DataFrame

In [13]:
#Create a matrix (2D array)
numpy_array = np.array([[4,2,9],[32,43,68],[73,8,9]])

In [14]:
df_from_array = pd.DataFrame(numpy_array)
df_from_array

Unnamed: 0,0,1,2
0,4,2,9
1,32,43,68
2,73,8,9


### Self Practice!

1. Create your own Python matrix
2. Convert the matrix into a NumPy array
3. Check the data type to make sure it is a NumPy array
4. Convert the NumPy array to a Pandas DataFrame

### Python Data Types

In [15]:
type(4)

int

In [16]:
type(4.0)

float

In [17]:
type('4353')

str

In [18]:
type('Hello')

str

###**Indexing, Slice indexing**

We can use slice indexing to pull out sub-regions of ndarrays

More documentation can be found at https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html

In [19]:
#create 1D array
array_1d = np.array([1, 2, 3, 4, 5, 6, 7])
print (array_1d)

[1 2 3 4 5 6 7]


In [20]:
#because we have 1D array, we need only one index to access element at any position
#call the value at index 2
print ('element at index 2: ', array_1d[2])

#ndarrays are mutable, here we change an element at index 2 
array_1d[2] = 10
print ('element at index 2 after change: ', array_1d[2])

element at index 2:  3
element at index 2 after change:  10


In [21]:
print (array_1d)

[ 1  2 10  4  5  6  7]


In [22]:
#get the values in the range
print ('elements in the range: ', array_1d[1:3]) #a:b - including a, until (and excluding) b

#we can change values in the range as well
array_1d[1:3] = 20
print ('elements in the range after change: ', array_1d[1:3])

elements in the range:  [ 2 10]
elements in the range after change:  [20 20]


In [23]:
# Nested list
l = [[1,2,3],[4,5,6],[6,7,8]]
print(l)

[[1, 2, 3], [4, 5, 6], [6, 7, 8]]


In [24]:
#for 2D arrays we will need two indexing first one for the row and second one for the column
#create 2D array
array_2d = np.array([[11, 12, 13, 14, 15, 16, 17], [21, 22, 23, 24, 25, 26, 27], [31, 32, 33, 34, 35, 36, 37]])
print ('original array: \n', array_2d, '\n')

#slicing: generates an array of the same rank
array_copy = array_2d.copy()
slice_row = array_copy[1:2, :] #a:b - including a, until (and excluding) b
print ('sliced array on [1:2, :]: \n', slice_row)

original array: 
 [[11 12 13 14 15 16 17]
 [21 22 23 24 25 26 27]
 [31 32 33 34 35 36 37]] 

sliced array on [1:2, :]: 
 [[21 22 23 24 25 26 27]]


In [25]:
#if we change the sliced array it will changed the original array too
slice_row[:, :] = 10

print ('sliced array: \n', slice_row, '\n')
print ('original array: \n', array_2d)


sliced array: 
 [[10 10 10 10 10 10 10]] 

original array: 
 [[11 12 13 14 15 16 17]
 [21 22 23 24 25 26 27]
 [31 32 33 34 35 36 37]]


In [26]:
#we can do the slicing for columns as well
slice_col = array_2d[:, 1:5]
print (slice_col)

[[12 13 14 15]
 [22 23 24 25]
 [32 33 34 35]]


In [27]:
#we can use filters to select just those elements which meet certain criteria 
#select the elements that are greater than 10
slice_col[slice_col>10]

array([12, 13, 14, 15, 22, 23, 24, 25, 32, 33, 34, 35])

In [28]:
#use similar logical filter to change elements in the array
#add 1 to all the odd values
slice_col[slice_col % 2 == 1] += 1
slice_col

array([[12, 14, 14, 16],
       [22, 24, 24, 26],
       [32, 34, 34, 36]])

###**Arithmetic array operations**

More documentation at https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.math.html

In [29]:
#addition can be done with the plus sign or 'add' numpy function
arr_x = np.array([[10, 20, 30], [40, 50, 60]])
arr_y = np.array([[11, 21, 31], [41, 51, 61]])

print("Array X")
print(arr_x, '\n')

print("Array Y")
print(arr_y, '\n')

print("Direct Addition")
print (arr_x + arr_y, '\n')

print("Numpy Addition")
print (np.add(arr_x, arr_y))

Array X
[[10 20 30]
 [40 50 60]] 

Array Y
[[11 21 31]
 [41 51 61]] 

Direct Addition
[[ 21  41  61]
 [ 81 101 121]] 

Numpy Addition
[[ 21  41  61]
 [ 81 101 121]]


In [30]:
#the same is with the subtraction 
print("Direct Subtraction")
print (arr_y - arr_x, '\n')

print("Numpy Subtraction")
print (np.subtract(arr_y, arr_x))

Direct Subtraction
[[1 1 1]
 [1 1 1]] 

Numpy Subtraction
[[1 1 1]
 [1 1 1]]


In [31]:
#multiplication
print("Direct Multiplication")
print (arr_x * arr_y, '\n')

print("Numpy Multiplication")
print (np.multiply(arr_x, arr_y))

Direct Multiplication
[[ 110  420  930]
 [1640 2550 3660]] 

Numpy Multiplication
[[ 110  420  930]
 [1640 2550 3660]]


In [32]:
#division
print("Direct Division")
print (arr_y / arr_x, '\n')

print("Numpy Division")
print (np.divide(arr_y, arr_x))

Direct Division
[[1.1        1.05       1.03333333]
 [1.025      1.02       1.01666667]] 

Numpy Division
[[1.1        1.05       1.03333333]
 [1.025      1.02       1.01666667]]


In [33]:
#square root

print (np.sqrt(arr_x))

[[3.16227766 4.47213595 5.47722558]
 [6.32455532 7.07106781 7.74596669]]


In [34]:
#exponent (e**x)

print (np.exp(arr_x))

[[2.20264658e+04 4.85165195e+08 1.06864746e+13]
 [2.35385267e+17 5.18470553e+21 1.14200739e+26]]


In [35]:
#power

print (np.power(arr_x, 3))

[[  1000   8000  27000]
 [ 64000 125000 216000]]


##**Pandas**
Pandas is an numerical open source python library that is built on top of NumPy. Why learning Pandas is important:

Pandas allows you do fast analysis as well as data cleaning and preparation
Pandas can work well with data from a wide variety of sources such as; Excel sheet, csv file, sql file or even a webpage

Documentation - https://pandas.pydata.org/pandas-docs/version/0.25.3/

###**Pandas Series**

Pandas series are one-dimensional labeled array that are capable of holding of data of any type

In [36]:
#creating panda series from list
ls = ['a', 'b', 'c', 'd', 'e']
ser_1 = pd.Series(ls)

print (ser_1)

0    a
1    b
2    c
3    d
4    e
dtype: object


In [37]:
#creating panda series from array
arr = np.array([10, 20, 30, 40, 50])
ser_2 = pd.Series(arr)

print (ser_2)

0    10
1    20
2    30
3    40
4    50
dtype: int64


In [38]:
#create series with specific indexing
ser_3 = pd.Series(arr, index = ['sarah', 'bob', 'alex', 'den', 'nancy'])

print (ser_3)

sarah    10
bob      20
alex     30
den      40
nancy    50
dtype: int64


In [39]:
#accessing an element using the index 
print (ser_3[3], '\n\n')
print (ser_3[[0, 2, 4]], '\n\n')
print (ser_3[:3])

40 


sarah    10
alex     30
nancy    50
dtype: int64 


sarah    10
bob      20
alex     30
dtype: int64


### Create a Pandas DataFrame

Pandas DataFrame is a two-dimensional labeled data structure.

In [40]:
#Creating from a list of lists
pet_info = [['Blain', 10, 'Dog'], ['Lucy', 4 , 'Cat'], ['Cinco', 8, 'Rabbit']]
pet_info

[['Blain', 10, 'Dog'], ['Lucy', 4, 'Cat'], ['Cinco', 8, 'Rabbit']]

In [41]:
pet_df = pd.DataFrame(pet_info, columns = ['Name', 'Age', 'Type'],
                      index = ['A', 'B', 'C'])
                     
pet_df

Unnamed: 0,Name,Age,Type
A,Blain,10,Dog
B,Lucy,4,Cat
C,Cinco,8,Rabbit


In [42]:
pet_df['Name']

A    Blain
B     Lucy
C    Cinco
Name: Name, dtype: object

### Self Practice!

Print out the 'Type' Column

In [43]:
type(pet_df)

pandas.core.frame.DataFrame

In [44]:
pet_df.dtypes

Name    object
Age      int64
Type    object
dtype: object

### Self Practice!



In [45]:
#Create a list of lists that contain [[Month, Month #(Dec=12, etc.), Season]]
#Create a Pandas DataFrame with the columns = ['Month', 'Int', 'Season']

### iloc vs. loc

iloc - position-based indexing (index always starts with 0)

loc - label-based indexing

If you don't specify indexes, then iloc and loc will be the same

In [46]:
pet_df.iloc[0]

Name    Blain
Age        10
Type      Dog
Name: A, dtype: object

In [47]:
pet_df.loc['B']

Name    Lucy
Age        4
Type     Cat
Name: B, dtype: object

### Apply Concepts to a Data Set

In [48]:
from sklearn.datasets import load_iris
iris = load_iris()
data = iris.data
column_names = iris.feature_names

In [49]:
data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/iris.csv')

In [50]:
#columns
data.columns

Index(['sepal.length', 'sepal.width', 'petal.length', 'petal.width',
       'variety'],
      dtype='object')

In [51]:
#rows
len(data)

150

### Slicing Arrays, Lists, and Tuples

Slicing index always starts at 0, which mean the 1st element is [0] and 2nd element is [1].

In [52]:
data[:10]

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa
5,5.4,3.9,1.7,0.4,Setosa
6,4.6,3.4,1.4,0.3,Setosa
7,5.0,3.4,1.5,0.2,Setosa
8,4.4,2.9,1.4,0.2,Setosa
9,4.9,3.1,1.5,0.1,Setosa


In [53]:
data[2:5]

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


In [54]:
data[0:10:2] #intervals

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa
6,4.6,3.4,1.4,0.3,Setosa
8,4.4,2.9,1.4,0.2,Setosa


In [55]:
data.head(10)

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa
5,5.4,3.9,1.7,0.4,Setosa
6,4.6,3.4,1.4,0.3,Setosa
7,5.0,3.4,1.5,0.2,Setosa
8,4.4,2.9,1.4,0.2,Setosa
9,4.9,3.1,1.5,0.1,Setosa


In [56]:
data.tail()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
145,6.7,3.0,5.2,2.3,Virginica
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica
149,5.9,3.0,5.1,1.8,Virginica


In [57]:
data.describe()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [58]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal.length  150 non-null    float64
 1   sepal.width   150 non-null    float64
 2   petal.length  150 non-null    float64
 3   petal.width   150 non-null    float64
 4   variety       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [59]:
data.loc[140:]

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
140,6.7,3.1,5.6,2.4,Virginica
141,6.9,3.1,5.1,2.3,Virginica
142,5.8,2.7,5.1,1.9,Virginica
143,6.8,3.2,5.9,2.3,Virginica
144,6.7,3.3,5.7,2.5,Virginica
145,6.7,3.0,5.2,2.3,Virginica
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica
149,5.9,3.0,5.1,1.8,Virginica


In [60]:
data.mode()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.0,3.0,1.4,0.2,Setosa
1,,,1.5,,Versicolor
2,,,,,Virginica


In [61]:
data.mean()

sepal.length    5.843333
sepal.width     3.057333
petal.length    3.758000
petal.width     1.199333
dtype: float64

In [62]:
data['sepal.length'].mean()

5.843333333333335

### Self Practice!

Find the mean petal width

In [63]:
data.groupby(['sepal.width']).mean()

Unnamed: 0_level_0,sepal.length,petal.length,petal.width
sepal.width,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2.0,5.0,3.5,1.0
2.2,6.066667,4.5,1.333333
2.3,5.325,3.25,0.975
2.4,5.3,3.6,1.033333
2.5,5.7625,4.5125,1.55
2.6,6.16,4.88,1.42
2.7,5.855556,4.622222,1.555556
2.8,6.335714,5.042857,1.707143
2.9,6.06,4.35,1.32
3.0,6.015385,4.234615,1.403846


### Self Practice! 

In [64]:
from sklearn.datasets import load_boston
boston = load_boston()
data = boston.data
column_names = boston.feature_names

In [65]:
data

array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
        4.9800e+00],
       [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
        9.1400e+00],
       [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
        4.0300e+00],
       ...,
       [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        5.6400e+00],
       [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
        6.4800e+00],
       [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        7.8800e+00]])

In [66]:
df = pd.DataFrame(data, columns = column_names)
df

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48


## Thank you