Python is a powerful and flexible programming language. However, it doesn't have tools for performing complex mathematical operations and visualizing data. We mainly use three libraries for performing data analysis:-

**1)Numpy** <br>
**2)Matplotlib** <br>
**3)Pandas**

We will be covering Numpy and Matplotlib in this module.

# Basic Data Tools: Numpy

Numpy features are multi-dimensional, hence its used extensively in Machine Learning. Its highlights are:-

**1)Mathematical toolkit(sin, log, exp)** <br>
**2)Random Sampling Module(often used in ML)** <br>
**3)Numpy nd-array object**

A numpy array can be one dimensional, two dimensional or n-dimensional. We will look at:
1. Making numpy arrays
2. How numpy arrays are faster than python lists
3. Numpy array operations
4. Broadcasting

### Making Numpy Arrays

In [1]:
#to access numpy, we will have to import it
import numpy as np

In [2]:
lists = [[1,2],[3,4]]
np_array = np.array(lists) # make a numpy array from a list
np_array

array([[1, 2],
       [3, 4]])

In [3]:
np_array.shape # this specifies the dimension of the matrix/array

(2, 2)

In [4]:
np_array_modified = np_array.reshape(4,1) # reshaping the array into another dimension 
np_array_modified

array([[1],
       [2],
       [3],
       [4]])

We can also create a variety of arrays using numpy functionalities:-

In [None]:
np.linspace(1,5,3) # creates evenly spaced numbers

In [None]:
np.zeros(5) # array of zeros

In [5]:
np.zeros((3,3,3)) # 3 dimensional array

array([[[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]])

In [6]:
np.eye(5) # identity matrix

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

In [7]:
np.diag((2,5,7)) # diagonal matrix

array([[2, 0, 0],
       [0, 5, 0],
       [0, 0, 7]])

### Numpy Optimality Test

**Why are numpy arrays useful?** 

Numpy arrays are implemented with optimized code in C++, and python uses them as an interface, so it does computation quite fast

In [8]:
list_1 = [np.random.random() for _ in range(10000)]
list_2 = np.array(list_1)

In [9]:
%%timeit -n 50000 #run it 50000 times and compute the avergae time needed
np.sum(list_2)

3.28 μs ± 172 ns per loop (mean ± std. dev. of 7 runs, 50,000 loops each)


In [None]:
# TODO: Now find the time needed for using the built-in sum function on list_1

### Numpy Operations

Numpy has also convenient way of performing operations. Lets calculate column sum of lists and np_array variables created above

In [None]:
lists

In [None]:
row_length = len(lists[0])
[sum(row[i] for row in lists) for i in range(row_length)]

In [None]:
np.sum(np_array,axis = 0)

In [17]:
list_3 = np.array([1,2,3])
list_4 = np.array([1,4,9])

In [None]:
list_3 = np.array([1,2,3])
list_4 = np.array([1,4,9])

print(np.dot(list_3,list_4)) # dot_product
print(list_3 @ list_4) # QUESTION: what does this do?
print(list_3+list_4) #element-wise addition
print(list_3*list_4) #element-wise multiplication
print(np.cos(list_3)) #cosine of each entry

In [18]:
## TODO: Make two arrays and compute their dot product

In [None]:
np_matrix = np.random.randint(5,25,size = (5,6))
np_matrix

In [None]:
print(np_matrix[1,2:5])
print(np.mean(np_matrix,axis = 1)) #certain data aggregation options
print(np.std(np_matrix, axis = 1)) 

In [None]:
# We can multiply matrices using @
# TODO: Multiple lists by an array given by [2,4]

These are some of the basic functionalities of Numpy. As we go along the course, we will encounter more. 

The first step often in data analysis is data visualization. Matplotlib is a very important library for plotting, which we shall take a look at next

### Broadcasting

Broadcasting is a powerful mechanism that allows numpy to work with arrays of different shapes when performing arithmetic operations. Frequently we have a smaller array and a larger array, and we want to use the smaller array multiple times to perform some operation on the larger array.

For example, suppose that we want to add a constant vector to each row of a matrix. We could do it like this:

In [None]:
# We will add the vector v to each row of the matrix x,
# storing the result in the matrix y
x = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])  # [4 x 3] matrix
v = np.array([1, 0, 1])  # [3]-dim vector
y = np.zeros_like(x)   # Create an empty matrix with the same shape as x

# Add the vector v to each row of the matrix x with an explicit loop
for i in range(4):
    y[i, :] = x[i, :] + v

y

This works; however when the matrix x is very large, computing an explicit loop in Python could be slow. Note that adding the vector v to each row of the matrix x is equivalent to forming a matrix vv by stacking multiple copies of v vertically, then performing elementwise summation of x and vv. We could implement this approach like this:

In [None]:
vv = np.tile(v, (4, 1))  # Stack 4 copies of v on top of each other
print(vv)
y = x + vv  # Add x and vv elementwise
y

Numpy broadcasting allows us to perform this computation without actually creating multiple copies of v. Consider this version, using broadcasting:

In [None]:
print(f"x.shape = {x.shape}, v.shape={v.shape}")
y = x + v  # Add v to each row of x using broadcasting
y

In [None]:
# TODO: Make x with shape 10000 x 1000 and v with shape 1000 and time all three ways to get y

Recommended: read the explanation from the [documentation](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html) or this [explanation](http://wiki.scipy.org/EricsBroadcastingDoc).

Functions that support broadcasting are known as **universal functions**. You can find the list of all universal functions in the [documentation](http://docs.scipy.org/doc/numpy/reference/ufuncs.html#available-ufuncs).

# Basic Data Tools: Matplotlib

In [24]:
import matplotlib.pyplot as plt

In [25]:
num_of_days = 365

days_array = np.arange(num_of_days)
profit_array = np.random.randn(num_of_days)* 10000  


In [None]:
plt.plot(days_array,profit_array)
plt.xlabel('Time(in Days)')             # data plotting
plt.ylabel('Profit in Millions')   

In [27]:

np.random.seed(2)
actuals_turnover = np.random.randint(100000,200000,(1,20))
forecast_turnover = 130000 + np.random.randint(0,100000,(1,20))*0.3

In [None]:
plt.scatter(actuals_turnover,forecast_turnover, label = 'Data')
plt.plot(np.arange(100000,200000),np.arange(100000,200000), color = 'r')
plt.xlabel('Actuals(in million)')
plt.ylabel('Forecasts(in million)')

In [None]:
# TODO: Plot the graph of a parabolic function in one variable using matplotlib. How would you do it?
# Hint (Well, more like the answer): Use np.linspace to create a dense discrete range of values for x and then plot y = x^2