<h3> Introduction to Numpy </h3>
In this section we will learn about numpy package that helps in working with arrays.
<h3> References </h3>
Python for Data Analysis, 2nd Edition <br>
<a href="https://docs.scipy.org/doc/numpy-dev/user/quickstart.html"> Numpy Tutorial </a><br>


#### Histograms

In [None]:
import numpy as np
import matplotlib.pyplot as plt

### What is historgram?

In [None]:
sa = np.array( [1, 2, 3, 4, 5] )
data = [ np.random.choice(sa) for _ in range(1000) ]
#print (data)

In [None]:
# hist[i] number of samples in bins[i]
# number of samples in bin 0 is  defined by binedges[0]<= values of v < bins_edges[1]
binedges = np.array( [1, 2, 3, 4, 5, 6] )
hist, _ = np.histogram(data, bins=binedges)

In [None]:
plt.hist(data, bins=binedges)
plt.show()

In [None]:
data.count(1)

In [None]:
hist, _ = np.histogram(data, bins=binedges, density=True)

In [None]:
hist

density : bool, optional

If False, the result will contain the number of samples in each bin. If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function.

In [None]:
w = np.diff(binedges)

In [None]:
isum = w * hist
np.sum(isum)

Exercise 1: Create 1000 random integer data in the range 0 to 9 with uniform distribution using numpy function np.random.randint(). Create a histogram using pre-specified binedges = [0, 2, 4, 6, 8, 10]. Plot the historgram. 
For density = True option in the numpy funtion historgram() show that the histogram values in the bins are normalzied so that the integral over the range is 1. 

In [None]:
rng = np.random.RandomState(10)  # deterministic random data
mu, sigma = 2, 0.5
v = rng.normal(mu,sigma,1000)
print (np.mean(v))
print (np.std(v))
hist, bin_edges = np.histogram(v, bins=10, density=False)
print (hist)
print (bin_edges)

In [None]:
plt.hist(v, bins=bin_edges)
plt.show()

In [None]:
# hist[i] number of samples in bins[i]
# number of samples in bin 0 is  defined by bin_edges[0]<= values of v < bins_edges[1]
print (hist[0], bin_edges[0], bin_edges[1]) 
print (hist.sum())

In [None]:
histT, bin_edges = np.histogram(v, bins=10, density=True) 
# here hist[i] is the pdf value at bin i and is normalized such that integral over the range is 1
print (histT)
print (histT.sum())
print (np.diff(bin_edges))
np.sum(histT * np.diff(bin_edges))

In [None]:
import matplotlib.mlab as mlab   # for plotting pdf
plt.hist(v, normed=True, bins=bin_edges)
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = mlab.normpdf(x, mu, sigma)
plt.plot(x, p, 'r', linewidth=2)
plt.show()

In [None]:
plt.hist(v, normed=True, bins=bin_edges)
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = mlab.normpdf(x, np.mean(v), np.std(v))
plt.plot(x, p, 'r', linewidth=2)
plt.show()

In [None]:
histTP = histT * np.diff(bin_edges)
print (histTP.sum()) #total area for the histogram plot
#print (np.diff(bin_edges))
#print (histTP)
#print (hist/1000)

Exercise 2: This exercise is based on a real data of  sale of individual residential property in Ames, Iowa. To complete this exercise, please read the property sale data from a csv file <a href = "resources/train_house.csv"> property sale data</a>  into a 2d numpy array. For this you need to download the csv file and save it on your local folder. You can then use the following command to read the csv file into a 2d array:

house_data = np.genfromtxt('resources/train_house.csv', skip_header=1,delimiter=',')

Also, open the file in excel or notepad to see the header information. For this exercise there are three columns of our interest: 1stFlrSF (first floor sq ft), 2ndFlrSF (second floor sq ft), and  SalePrice.  Extract these columns from 2d numpy array house_data using numpy indexing. You need to compute the price per sqft (sale price/total sq ft) for each house. Compute the historgram and plot the price per sqft data using various number of bins=10,20,30,40, and 50. Based on the computed mean and standard deviation of the data, plot the normal pdf to show the fit along with  each plot (as shown above).  


### Health Data
http://www.healthdata.org/results/data-visualizations 

https://vizhub.healthdata.org/subnational/usa

http://ghdx.healthdata.org/record/united-states-diabetes-prevalence-county-1999-2012

Download the zip file, and open the excel file. Go to the control sheet, and delete first row and the third row. Next save the sheet as CSV file 'US_County_Diabetes_Control_Rate_1999_2012_full.csv'. 

In [None]:
ctrl_rate_data = np.genfromtxt('resources/US_County_Diabetes_Control_Rate_1999_2012_full.csv', skip_header=1, delimiter=',')

Exercise 3: For this exercise there are two columns of our interest: "Prevalence, 1999, Both Sexes", and "Prevalence, 2012, Both Sexes". xtract these columns from 2d numpy array house_data using numpy indexing. Plot historgrams for 1999 and 2012 data using 40 bins. Based on the computed mean and standard deviation of the data, plot the normal pdf to show the fit along with  each historgram plot. Comment on the differences between the two historgrams and the corresponding fits with the normal distribution.  

### Working with Sparse Matrices
http://www.scipy-lectures.org/advanced/scipy_sparse/storage_schemes.html

A matrix that has number of entries as zeroes is referred to as sparse and there are several storage schemes that take advantage of the sparsenenss and only store non-zeroes entries. 

Example sparse matrix: Netflix Rank Matrix of size 500K Users X  20K Movies (A row in this matrix is sparsely populated)

For computational efficiency it becomes necessary to work with sparse storage structure of large sparse matrices. 

### Example Sparse Matrix Storage Formats

####  Coordinate Format (COO)
-  also known as the ‘ijv’ or ‘triplet’ format
-  three NumPy arrays: row, col, data
-  data[i] is value at (row[i], col[i]) position


In [None]:
import numpy as np
import scipy.sparse as sps

In [None]:
#create using (data, ij) tuple:
row = np.array([0, 3, 1, 0])
col = np.array([0, 3, 1, 2])
data = np.array([4, 5, 7, 9])
mtx = sps.coo_matrix((data, (row, col)), shape=(4, 4))
mtx     
mtx.todense()


In [None]:
#duplicates entries are summed together:
row = np.array([0, 0, 1, 3, 1, 0, 0])
col = np.array([0, 2, 1, 3, 1, 0, 0])
data = np.array([1, 1, 1, 1, 1, 1, 1])
mtx = sps.coo_matrix((data, (row, col)), shape=(4, 4))
mtx.todense()


#### Compressed Sparse Row Format (CSR)
- three NumPy arrays: indices, indptr, data
- indices is array of column indices
- data is array of corresponding nonzero values
- indptr points to row starts in indices and data


In [None]:
data = np.array([1, 2, 3, 4, 5, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
indptr = np.array([0, 2, 3, 6])
mtx = sps.csr_matrix((data, indices, indptr), shape=(3, 3))
mtx.todense()

In [None]:
from scipy.sparse import lil_matrix
from scipy.sparse.linalg import spsolve
from numpy.linalg import solve, norm
from numpy.random import rand
import matplotlib.pylab as plt

In [None]:
A = lil_matrix((1000, 1000))
A[0, :100] = rand(100)
A[1, 100:200] = A[0, :100]
A.setdiag(rand(1000))


In [None]:
# Now convert it to CSR format
A = A.tocsr()    
plt.spy(A)
plt.show()

In [None]:
# Create RHS and solve A x = b for x:
b = rand(1000)
x = spsolve(A, b)

In [None]:
# Convert it to a dense matrix and solve, and check that the result is the same:
x_ = solve(A.toarray(), b)

In [None]:
err = norm(x-x_)
err < 1e-10


Exercise 4: Create a sparse movie rank matrix for 200 users and 200 movies. For each user assign a random score of 1 to 5 for 3 to 5 randomly selected movies. Use spy() to visualize the randomly created matrix.
HINT: Create three numpy arrays 'ijv' for coo format using numpy functions such as: np.random.choice(), np.random.randint(), np.full(). You may want to setup a loop over number of users for populating row and column index arrays. 