# **Introduction to Machine Learning !**

## NumPy

In the previous notebook we learnt the basics of Python and why it's so popular for application in AI. We also learnt about the different data types available in Python, what modules and packages are along with different control options.

Now let's get started with **NumPy** (short for Numerical'Num' Python'Py')  one of the most popular packages available in Python for performing scientfic calculcations especially those related to arrays. 

Let's get started! 

## Arrays

NumPy arrays in python are basically grids of values of the same data type. 

Each element of a NumPy array has a non-negative index associated with it. 

When dealing with arrays, we come across a useful term called **shape** which gives the size of the array along each dimension. 

<u>**So why do we use NumPy array instead of standard Python lists?**</u><br> 
    
- In Data Science calculations, ***NumPy arrays take up smaller memory consumption and better runtime behavior*** this is important as commonly, we deal with a large amount of data and the difference in processing Python lists and NumPy arrays will add up to a significant difference. 
- NumPy also contains multi-dimensional array and matrix data structures and can perform several mathematical operations on arrays such as trigonometric, statistical, and algebraic routines. 

# Class Examples

In [3]:
import numpy as np
x = np.array([-2, -4, 7, 10, 12])
print ("Sum :", x.sum())
print ("Product :", x.prod())
print ("Min :", x.min())
print ("Max :", x.max())
print ("Max element is at position ", x.argmax())

Sum : 23
Product : 6720
Min : -4
Max : 12
Max element is at position  4


In [4]:
dir(np)

['ALLOW_THREADS',
 'AxisError',
 'BUFSIZE',
 'Bytes0',
 'CLIP',
 'DataSource',
 'Datetime64',
 'ERR_CALL',
 'ERR_DEFAULT',
 'ERR_IGNORE',
 'ERR_LOG',
 'ERR_PRINT',
 'ERR_RAISE',
 'ERR_WARN',
 'FLOATING_POINT_SUPPORT',
 'FPE_DIVIDEBYZERO',
 'FPE_INVALID',
 'FPE_OVERFLOW',
 'FPE_UNDERFLOW',
 'False_',
 'Inf',
 'Infinity',
 'MAXDIMS',
 'MAY_SHARE_BOUNDS',
 'MAY_SHARE_EXACT',
 'MachAr',
 'NAN',
 'NINF',
 'NZERO',
 'NaN',
 'PINF',
 'PZERO',
 'RAISE',
 'SHIFT_DIVIDEBYZERO',
 'SHIFT_INVALID',
 'SHIFT_OVERFLOW',
 'SHIFT_UNDERFLOW',
 'ScalarType',
 'Str0',
 'Tester',
 'TooHardError',
 'True_',
 'UFUNC_BUFSIZE_DEFAULT',
 'UFUNC_PYVALS_NAME',
 'Uint64',
 'WRAP',
 '_NoValue',
 '_UFUNC_API',
 '__NUMPY_SETUP__',
 '__all__',
 '__builtins__',
 '__cached__',
 '__config__',
 '__deprecated_attrs__',
 '__dir__',
 '__doc__',
 '__expired_functions__',
 '__file__',
 '__getattr__',
 '__git_revision__',
 '__loader__',
 '__mkl_version__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '

In [6]:
ndarray = np.arange(10)
print (x)

[-2 -4  7 10 12]


In [7]:
ndarray.reshape((5, 2))

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])

In [8]:
ndarray1 = ndarray.reshape((5, 2))
ndarray1.reshape((2, 5))

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

In [9]:
ndarray1.flatten()

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [10]:
ndarray1 = np.array([[1, 2, 3], [4, 5, 6]])
ndarray2 = np.array([[7, 8, 9], [10, 11, 12]])

In [11]:
np.vstack((ndarray1, ndarray2))

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [12]:
np.hstack((ndarray1, ndarray2))

array([[ 1,  2,  3,  7,  8,  9],
       [ 4,  5,  6, 10, 11, 12]])

In [13]:
np.split(np.vstack((ndarray1, ndarray2)), 2)

[array([[1, 2, 3],
        [4, 5, 6]]),
 array([[ 7,  8,  9],
        [10, 11, 12]])]

In [14]:
result = np.random.randint(1, 7, size=10)
print(result)

[2 3 4 1 3 3 3 1 4 2]


In [15]:
outcome = np.random.randint(1, 7, size=(3, 3))
print(outcome)

[[5 6 6]
 [3 6 2]
 [1 6 6]]


In [17]:
x = [10, 11, 12, 34]
np.random.shuffle(x)
print (x)

[11, 10, 34, 12]


# Let's start with Lab Work

In [18]:
import numpy as np #importing numpy package 
a = np.array([1,2,3]) #initializing/creating an array(ndarray) with the elements given in the brackets
print (a) #printing the array(ndarray)

[1 2 3]


**Let's check the datatype for the array 'a'**

In [19]:
print (type(a)) #write your code for checking the datatype here

<class 'numpy.ndarray'>


In [20]:
#Creating an array with more than one dimension
b = np.array([[1, 2], [3, 4]]) 
print (b)

[[1 2]
 [3 4]]


In [21]:
#let's check the shape of the arrays created so far
print("Shape of array 'a' is:",(a.shape))
print("Shape of array 'b' is:",(b.shape))

Shape of array 'a' is: (3,)
Shape of array 'b' is: (2, 2)


In [22]:
#accessing elements of the array using the index
print(a[0], a[1], a[2]) 
print(b[0][0]) #printing the first row first, first column element

1 2 3
1


In [23]:
#another way of printing a particular element of an array
print(b[0,1])

2


In [24]:
#using the arange function to create an array
c=np.arange(4) #prints the first four elements as 0 onwards by default
print (c)
d=np.arange(1,7)#specifying start point and end point
print (d)

[0 1 2 3]
[1 2 3 4 5 6]


In [25]:
# Create the following array with given shape (3, 4). Name the array 'new'
# [[ 1  2  3  4]
#  [ 5  6  7  8]
#  [ 9 10 11 12]]
new = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
print (new)

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]


In [26]:
# Slicing is being used to pull out the subarray consisting of the first 2 rows
# and columns 1 and 2; new_sub is the following array of shape (2, 2):

new_sub = new[:2, 1:3]
print(new_sub)

[[2 3]
 [6 7]]


### Integer array indexing 

When you index into numpy arrays using slicing, the resulting array will always be a subarray of the original array. In contrast, integer array indexing allows you to construct arbitrary arrays using the data from another array. 

Do you recall slicing? We have gone through this in the previous notebook! 

Remember, indexes in Python always starts from 0 for the first element.

Here is an example:

In [27]:
a = np.array([[1,2], [3, 4], [5, 6]])

#Given the following, this is output of the above, can you see how you would get [1 4 5]? 

#For the value of '1' it is at position 0,0 which is the first 'box' and the first 'item' in the 'box'

#For the vlaue of '4' it is at the position of 1,1 which is the second 'box' and the second 'item' in the 'box'

#Can you now figure out how you would get the value of '5'?

print(np.array([a[0, 0], a[1, 1], a[2, 0]]))  # Prints "[1 4 5]"


#The above can also be written as:
# a[0,0] '1'
# a[1,1] '4'
# a[2,0] '5'
# This is an Example of integer array indexing. Look closely can you tell the difference between this statement and the last?
b=(a[[0, 1, 2], [0, 1, 0]]) # Do you now see how array indexing is done and how it relates to the previous statements?


print(b)  # Prints "[1 4 5]"
print(b.shape) #Shape is 3

[1 4 5]
[1 4 5]
(3,)


In [28]:
# The above example of integer array indexing is equivalent to this:
print(np.array([a[0, 0], a[1, 1], a[2, 0]]))  # Prints "[1 4 5]"

# When using integer array indexing, you can reuse the same
# element from the source array:
print(a[[0, 0], [1, 1]])  # Prints "[2 2]"

# Equivalent to the previous integer array indexing example
print(np.array([a[0, 1], a[0, 1]]))  # Prints "[2 2]"

[1 4 5]
[2 2]
[2 2]


We can also use integer indexing to mutate specific elements of an array

In [29]:
# Create an array of indices
b = np.array([1, 0, 1])
print(b)

# Select one element from each row of a using the indices in b
print(a[np.arange(3), b])

[1 0 1]
[2 3 6]


In [30]:
# Mutate one element from each row of a using the indices in b
a[np.arange(3), b] += 10

print(a)

[[ 1 12]
 [13  4]
 [ 5 16]]


## Datatypes in NumPy Arrays

In [31]:
roll_no = np.array([23, 46])   # Let numpy choose the datatype
print(roll_no.dtype)         # Prints "int64"

marks = np.array([65.5, 88.0])   # Let numpy choose the datatype
print(marks.dtype)             # Prints "float64"

int64
float64


In [32]:
x = np.array([1, 2], dtype=np.int64)   # Force a particular datatype
print(x.dtype)                         # Prints "int64"

int64


**Array Mathematics**

Basic mathematical operations are available on arrays in an elementwise manner. This can be done using either operators like (+,-,* etc.) or as built-in functions within the package itself.

In [33]:
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)

# Elementwise sum; both produce the array
# [[ 6.0  8.0]
#  [10.0 12.0]]
print(x + y)
print(np.add(x, y))

[[ 6.  8.]
 [10. 12.]]
[[ 6.  8.]
 [10. 12.]]


In [34]:
#write a code to create two arrays X & Y with the following elements and print their difference
# X = [[13.0 7.0]
#     [4.0 47.0]]
# Y = [[16.0 14.0] 
#     [1.0 23.0]]

In [35]:
X = np.array([[13.0,7.0],[4.0,47.0]], dtype=np.float64)
Y = np.array([[16.0,14.0],[1.0,23.0]], dtype=np.float64)
print (X-Y)
print(np.subtract(X, Y))

[[-3. -7.]
 [ 3. 24.]]
[[-3. -7.]
 [ 3. 24.]]


In [36]:
#to perform elementwise multiplication
print(x * y)
print(np.multiply(x, y))

[[ 5. 12.]
 [21. 32.]]
[[ 5. 12.]
 [21. 32.]]


Note that this is different from performing matrix multiplication. The * operator is only used for elementwise multiplication.

In [37]:
#to perform elementwise division
print(x / y)
print(np.divide(x, y))

[[0.2        0.33333333]
 [0.42857143 0.5       ]]
[[0.2        0.33333333]
 [0.42857143 0.5       ]]


**Sorting NumPy Arrays**

In [38]:
a = np.array([1, 2, 3, 4, 5, 2, 1])
print(a)

[1 2 3 4 5 2 1]


In [39]:
np.sort(a) #function for sorting elements in ascending order
print (np.sort(a))

[1 1 2 2 3 4 5]


In [40]:
np.flip(np.sort(a)) #function for sorting elements in descending order
print (np.flip(np.sort(a)))

[5 4 3 2 2 1 1]


In [41]:
np.flip(np.sort(a))[:2]  #slicing and sorting
print (np.flip(np.sort(a))[:2])

[5 4]


## Manipulating NumPy Arrays

The function reshape() allows manipulation of shape of array without altering its elements. The syntax for this function is illustrated below

In [42]:
x=np.arange(12) #creating 12*1 array
print(x)

[ 0  1  2  3  4  5  6  7  8  9 10 11]


In [43]:
y=x.reshape(6,2) #changing shape from 12*1 to 6*2 array
print(y)

[[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]]


Just like concatenation of lists or strings we can concatenate arrays as well. In the following example we concatenate three one-dimensional arrays to one array. 

The elements of the second array are appended to the first array. After this the elements of the third array are appended.

In [44]:
x = np.array([3,12])
y = np.array([12,4,7])
z = np.array([2,6,8])
print(x)
print(y)
print(z)
c = np.concatenate((x,y,z)) #function to perform concatenation
print(c)

[ 3 12]
[12  4  7]
[2 6 8]
[ 3 12 12  4  7  2  6  8]


If we are concatenating multidimensional arrays, we can concatenate the arrays according to axis. Arrays must have the same shape to be concatenated with concatenate(). 

In the case of multidimensional arrays, we can arrange them according to the axis. The default value is axis = 0:

In [45]:
x = np.array([[2,4],[7,9]])
print(x)
x=x.reshape(1,4)
y = np.array([10,12,14,16])
print(y)
y=y.reshape(1,4)
c = np.concatenate((x,y),axis=1)
print(c)

[[2 4]
 [7 9]]
[10 12 14 16]
[[ 2  4  7  9 10 12 14 16]]


**Pro-tip**: To have a list of useful functions and commands always at your fingertips you can save this [cheatsheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf) here!

## Assignment

In [None]:
# Create a 1-D NumPy array of 10 elements where all the elements are 0


In [None]:
# Check the dimension of above array 


In [None]:
# Convert above 1-D array into 2-D array with dimention (5,2)


In [None]:
# Create a NumPy array of 10 elements where all the elements are 1


In [None]:
# Create a NumPy array of 10 random elements.



# Print the last 5 elements of the array

# Print elements from 3rd index to 7th index.

# Sort array in ascending order


# **Pandas**

Pandas is built on top of NumPy and is one of the most popular open-source packages available in Python. It's main utility lies in providing a large number of functions for handling real world data.

You would not need to fully understand the capabilities of Pandas at present, the examples given are just to demonstrate the some of the useful workings of it. In the later sessions we will go through in full detail about how we would use Pandas in an ML workflow

However, we would suggest going forward, to pay special attention to the syntax and the flow and order of the codes as they are generally always being used in similar fashion.

### Importing Pandas

In [46]:
#Do you recognize the keywords here? Both import and the library name, such syntax will be common going forward
import pandas as pd

 **Data Frame**

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

DataFrames accepts many different kinds of inputs, if you would like to learn more, feel free to search online or go to [pandas'](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html) documentation. <br>For the purpose of this example, we shall demonstrate a useful way you can use DataFrames. <br><br>

In [47]:
import pandas as pd
import numpy as np

### Pandas Data Structure:
- **Series**:It is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index.
- **DataFrame**: Pandas data structure which has following properties, Two-dimensional, size-mutable, potentially heterogeneous tabular data.

In [48]:
pandas_series = pd.Series([14, 23, 16.1, -9]) #String representation of a Series
pandas_series

0    14.0
1    23.0
2    16.1
3    -9.0
dtype: float64

In [49]:
pandas_series = pd.Series([14, 23, 16.1, -9], index = ["w", "x", "y", "z"]) #Naming the index as "w",x","y","z"
pandas_series

w    14.0
x    23.0
y    16.1
z    -9.0
dtype: float64

In [50]:
pandas_series[1] #Printing value at index 1

23.0

In [51]:
#What will happen if we try to print index value greeater than the series?


In [52]:
pandas_series [pandas_series < 10 ] #Conditional statement to print values in Series < 10

z   -9.0
dtype: float64

In [53]:
pandas_series[pandas_series > 1]

w    14.0
x    23.0
y    16.1
dtype: float64

In [54]:
#series from dictionary

dictionary = {"x":100,"y":104.2,"z":-23.4}
pd.Series(dictionary)

x    100.0
y    104.2
z    -23.4
dtype: float64

In [55]:
data = {"date" : [ "Jan,2014", "Jan, 2013","Jan, 2012","Jan,2011"], "global_temp": [0.87,0.74,0.65,0.63]
       }
dataframe = pd.DataFrame(data)
dataframe

Unnamed: 0,date,global_temp
0,"Jan,2014",0.87
1,"Jan, 2013",0.74
2,"Jan, 2012",0.65
3,"Jan,2011",0.63


In [56]:
pd.DataFrame(dataframe,columns = ["global_temp","date"])

Unnamed: 0,global_temp,date
0,0.87,"Jan,2014"
1,0.74,"Jan, 2013"
2,0.65,"Jan, 2012"
3,0.63,"Jan,2011"


In [57]:
print (dataframe.date)
print ("-------")
print (dataframe['date'])

0     Jan,2014
1    Jan, 2013
2    Jan, 2012
3     Jan,2011
Name: date, dtype: object
-------
0     Jan,2014
1    Jan, 2013
2    Jan, 2012
3     Jan,2011
Name: date, dtype: object


In [58]:
dataframe['date'].values #Retrieving the column names

array(['Jan,2014', 'Jan, 2013', 'Jan, 2012', 'Jan,2011'], dtype=object)

In [59]:
dataframe['date'].keys #returns the 'info axis' for the pandas object. 
#If the pandas object is series then it returns index. 
#If the pandas object is dataframe then it returns columns.

<bound method Series.keys of 0     Jan,2014
1    Jan, 2013
2    Jan, 2012
3     Jan,2011
Name: date, dtype: object>

In [60]:
#Dataframe : viewing Data.

dataframe = pd.DataFrame(np.random.randn(20, 5), 
                         columns = ['A', 'B', 'C', 'D', 'E'])

In [61]:
print (dataframe.head()) #Return the first n rows

          A         B         C         D         E
0  0.304966  0.954891  0.374953 -0.162668 -1.018769
1 -0.198257  1.022516 -0.290854 -0.135745  2.230877
2 -0.064832 -0.258836 -0.907624  0.225065  0.143518
3 -0.957679  0.257221 -0.637870  1.188904  1.739763
4 -0.588647  0.116084  0.425431  0.070502  1.335398


In [62]:
print (dataframe.tail(n = 2)) #Return the last n rows, in this case 2

           A         B         C         D         E
18 -0.935542  0.309766  1.083924 -1.952566 -0.313348
19  0.872242 -0.637429  0.362523  0.877372 -0.732539


In [63]:
print (dataframe.describe()) #Generate descriptive statistics.
#Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

               A          B          C          D          E
count  20.000000  20.000000  20.000000  20.000000  20.000000
mean   -0.100483   0.239777   0.011243  -0.062165   0.016519
std     0.827295   0.845928   0.891605   1.228239   0.971992
min    -1.207208  -1.309124  -1.612182  -2.009515  -1.303967
25%    -0.737333  -0.223854  -0.717235  -0.780171  -0.767276
50%    -0.174141   0.261604   0.018162  -0.149207  -0.077060
75%     0.438964   0.971797   0.532904   0.234798   0.584859
max     1.552435   1.517161   2.024938   3.122326   2.230877


In [64]:
print (dataframe.sort_values(by='C').head())#Sort by the values along either axis

           A         B         C         D         E
5  -0.671264 -0.730282 -1.612182 -1.037877 -1.085076
2  -0.064832 -0.258836 -0.907624  0.225065  0.143518
14 -0.938920 -1.203129 -0.824873  0.054502  0.626523
15  0.981772 -0.027189 -0.820087 -0.254305  0.764684
8   0.840959  1.496432 -0.769800 -0.767107  0.160801


In [65]:
print (dataframe.sort_index(ascending = False).head()) #Sort object by labels (along an axis)

           A         B         C         D         E
19  0.872242 -0.637429  0.362523  0.877372 -0.732539
18 -0.935542  0.309766  1.083924 -1.952566 -0.313348
17 -0.150025 -1.309124 -0.024766 -0.331116 -0.926042
16 -0.616171 -0.187114  0.881771  0.263996 -1.303967
15  0.981772 -0.027189 -0.820087 -0.254305  0.764684


In [66]:
print (dataframe['A'].head())

0    0.304966
1   -0.198257
2   -0.064832
3   -0.957679
4   -0.588647
Name: A, dtype: float64


In [67]:
print (dataframe[1:4])

          A         B         C         D         E
1 -0.198257  1.022516 -0.290854 -0.135745  2.230877
2 -0.064832 -0.258836 -0.907624  0.225065  0.143518
3 -0.957679  0.257221 -0.637870  1.188904  1.739763


## Assignment

Let's say we are interested in the weather, specifically we are interested in data from [Open Government Data](https://data.gov.in/keywords/weather) and have obtained a file called (JaipurFinalCleanData.csv) <br><br>A Comma-Separated Value or csv file, as the name implies contains a set of data separated by commas, files like these would be familiar to users of Spreadsheet software like Excel or Google Sheet.
<br><br>In this example, we will open a csv file in Python, and load it into a DataFrame.
<br><br>Try loading in the file, and follow the rest of the exercise. Use your friend, the search engine and find a way to load your file into Colab!

In [68]:
#If doing locally using Anaconda for example, the file(csv) should be in same the folder as this notebook.
#This example is expected to have errors without the file being uploaded



#First, create a variable we will call the DataFrame from the stored location

#Prin the first five row of DataFrame to understand data.
print(dataframe.head())

          A         B         C         D         E
0  0.304966  0.954891  0.374953 -0.162668 -1.018769
1 -0.198257  1.022516 -0.290854 -0.135745  2.230877
2 -0.064832 -0.258836 -0.907624  0.225065  0.143518
3 -0.957679  0.257221 -0.637870  1.188904  1.739763
4 -0.588647  0.116084  0.425431  0.070502  1.335398


In [69]:
#Try  to display more data as so.


<div class="alert alert-block alert-info">
<b>Tip:</b> Another useful thing you could do is to check all the datatypes within the dataframe <br>
dataframe.dtypes
</div>

Have you noticed? Using print() on the DataFrame shows many values, but how are are you supposed to make sense of all of them? That is where we might want to do Data Visualization and use graphs, plots and charts to better understand what we are looking at and what we might be looking for. In the following Notebook we shall be looking into a useful and common library, **matplotlib** which helps us with this.