### **Part-1 : Hands on - Array Manipulation, Data Handling and Visualisation** 

Prerequiste : 

1. Little bit of programming experience with Python
2. Motivation and interest

**What will you learn after this session** : 

1. Basic introduction to NumPy. 
2. Importing data and manipulation with Pandas.
3. Visualisation with matplotlib and seaborn

This will be a self paced session. 

### Introduction to NumPy ~ 25 mins

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

1. A powerful N-dimensional array object
2. Sophisticated (broadcasting) functions
3. Tools for integrating C/C++ and Fortran code
4. Useful linear algebra, Fourier transform, and random number capabilities
5. Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.




**Most important design fact of NumPy**

1. NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects
2. NumPy operations perform complex computations on entire arrays without the need for Python for loops

In [None]:
import numpy as np
np.__version__

##### Exploring the intial documentation.

In [None]:
#np?
#dir(np)

##### Understanding main design principle of NumPy

Efficient data processing and manipulation is esssential to any fundamental activity of data science. Dataset can come from wide variety of sources and formats but despite this heterogenity data could be fundamentally array of numbers and NumPy seems to be the perfact sweet spot for speed and flexibility. 

In [None]:
n_elem = 500
my_arr = np.arange(n_elem)
my_list = list(range(n_elem))

## Using NumPy array operations
%time for _ in range(n_elem): my_arr2 = my_arr * 2 
        
## Using Python inbuilt list        
%time for _ in range(n_elem): my_list2 = [x * 2 for x in my_list]

##### Creating N dimensional array, Data Type and Array Manipulation

In [None]:
data = np.random.randn(4,4)
print(data)

In [None]:
type(data)

In [None]:
data.dtype

In [None]:
#dir(np)

##### Creating simple array in numpy

In [None]:
# Initialising the data array in Numpy with the list 
data_tmp = [1,2,3,4,5,6]
data_array = np.array(data_tmp)
print(data_array)

In [None]:
data_array = np.array([1,2,3,4,6])
print(data_array)

In [None]:
# Finding the shape of the array
data_array.shape

In [None]:
# Using evenly space values to create array
data_array = np.arange(15)
print(data_array)

In [None]:
# Using the different data type
data_array = np.array([1,2,4],dtype=np.string_)
print(data_array)

In [None]:
data_array = np.array([[1,2,3],[4,5,6]])
print(data_array)
data_array.shape

#--------------------------------#

##### Several other ways of creating the arrays

In [None]:
# Create a length-10 integer array filled with zeros
np.zeros(10, dtype=int)
print(data_array)

In [None]:
# Create a 3x5 floating-point array filled with ones
np.ones((3, 5), dtype=float)

In [None]:
# Create a 3x5 array filled with 3.14
np.full((3, 5), 3.14)

In [None]:
# Create an array filled with a linear sequence
# Starting at 0, ending at 20, stepping by 2
# (this is similar to the built-in range() function)
np.arange(0, 20, 2)

In [None]:
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)

In [None]:
# Create a 3x3 array of uniformly distributed
# random values between 0 and 1
np.random.random((3, 3))

In [None]:
# Create a 3x3 array of normally distributed random values
# with mean 0 and standard deviation 1
np.random.normal(0, 1, (3, 3))

In [None]:
# Create a 3x3 array of random integers in the interval [0, 10)
np.random.randint(0, 10, (3, 3))

In [None]:
# Create a 3x3 identity matrix
np.eye(3)

In [None]:
# Create an uninitialized array of three integers
# The values will be whatever happens to already exist at that memory location
np.empty(3)

In [None]:
np.ones(10, dtype='int16')

##### Basic indexing and slicing

Indexing in NumPy is quite similar to one in C and C++. The count starts from zero and can be accessed by sqaure brackets. 

In [None]:
data_array = np.arange(15)
print(data_array)

To access elements from the start

In [None]:
data_array[0]

To access elements from back

In [None]:
data_array[-1]

Accessing the multi-dimensional array

In [None]:
data_array = np.array([[1,2,3],[4,5,6]],dtype='int')

data_array[0,0]

Modifying the values

In [None]:
print(data_array)
data_array.shape

In [None]:
data_array[0,0] = 1.1

In [None]:
print(data_array)

##### Array Slicing

In [None]:
data_array = np.arange(10)
data_array

In [None]:
data_array[:5]  # first five elements

In [None]:
data_array[5:]  # elements after index 5

In [None]:
data_array[4:7]  # middle sub-array

In [None]:
data_array[::2]  # every other element

In [None]:
data_array[1::2]  # every other element, starting at index 1

In [None]:
data_array[::-1]  # all elements, reversed

In [None]:
data_array[5::-2]  # reversed every other from index 5

##### Array slicing in two dimensional case 

In [None]:
data_array = np.random.randn(10,10)
data_array

In [None]:
data_array[:,::3]

##### Fast Element-Wise Array Functions

A universal function, or ufunc, is a function that performs element-wise operations on data in ndarrays. Consider them as fast vectorized wrappers for simple functions that take one or more scalar values and produce one or more scalar results.

In [None]:
data_array = np.full((5, 5), 4)
data_array

In [None]:
np.sqrt(data_array)

In [None]:
np.exp(data_array)

In [None]:
np.isnan(data_array)

In [None]:
np.power(data_array,3)

In [None]:
data_array.reshape(25)
print(data_array)

##### More information can be found here at <>

##### Array oriented programming with Arrays 

In [None]:
import matplotlib.pyplot as plt

points = np.arange(-10,10,0.01)
x,y = np.meshgrid(points,points)
z =  np.sin(x) ** 10 + np.sin(10 * x) * np.tan(x)
plt.imshow(z); 
plt.colorbar()

##### Mathematical and Statistical Methods

In [None]:
points = np.arange(-10,10,0.01)

In [None]:
points.mean()

In [None]:
np.mean(points)

In [None]:
points.var()

In [None]:
points.std()

##### Unique logic for NumPy array

In [None]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])

In [None]:
np.unique(names)

##### Linear Algebra

###### Matrix multiplication

In [None]:
x = np.array([[1., 2., 3.], [4., 5., 6.]])
y = np.array([[6., 23.], [-1, 7], [8, 9]])

In [None]:
x.dot(y) 
np.dot(x,y)

###### linear algebra 

In [None]:
import numpy.linalg as nplg

In [None]:
#dir(nplg)

In [None]:
 X = np.random.randn(5, 5)
 Y = np.ones(5) 
nplg.solve(X,Y)

### Data Handling with Pandas ~ 30 mins

##### Pandas form the central python package for numerical computing, data analysis, data visualisation, machine learning. It is widely used for data cleaning and analysis and the biggest difference between NumPy and Pandas is handling of homogenous and heterogenous data

Please read more about Pandas here : 

<https://pandas.pydata.org>

In [None]:
import pandas as pd
import numpy as np

##### Introduction to Pandas data structure

1. **Series** 
2. **DataFrames**

Let's start with Pandas **Series** object

Constructing the series object 

**pd.Series(data, index=index)**

The type information in the pandas series object makes it much more efficient to process and store data compared to python dictionary object.


In [None]:
series = pd.Series([1,2,3,4,7])
series.values

In [None]:
series.index

In [None]:
series[2]

##### Series with specialised indexing

In [None]:
series = pd.Series([1, 2, 3, 4],
                 index=['a', 'b', 'c', 'd'])
series

In [None]:
series['b']

In [None]:
series = pd.Series([1, 2, 3, 4],
                 index=[1, 2, 3, 4])
series

In [None]:
series[1]

**Python dictionary and Pandas Series object**

In [None]:
python_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
series_population = pd.Series(python_dict)
series_population

In [None]:
series_population['California']

Slicing operation

In [None]:
series_population['California':'New York']

In [None]:
series_population

In [None]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
series_dict = pd.Series(area_dict)

**Pandas Dataframe**

Generic Way to create dataframe 

dataframe = pd.DataFrame(data,index=[],columns=[])


In [None]:
dataframe = pd.DataFrame(np.random.randn(10, 2),columns=['colA','colB'], index=['rowA','rowB','rowC','rowD','rowE', 'rowF', 'rowG', 'rowH', 'rowI', 'rowJ'])
dataframe

Accessing the first five entries 

In [None]:
dataframe.head()

Accessing the column 

In [None]:
dataframe['colA']

Accessing the row in entries

In [None]:
dataframe.loc['rowA']

In [None]:
dataframe.iloc[0]

In [None]:
print(dataframe.ix[0])
print(dataframe.ix['rowA'])

Changing the value of rows at once

In [None]:
dataframe.loc[1] =1

Changing the value of column 

In [None]:
dataframe['colC'] = 1

In [None]:
dataframe.values

Creating the dataframe with python dictionary 

In [None]:
states_dataframe = pd.DataFrame({'population': series_population,'area': area_dict})
states_dataframe

In [None]:
states_dataframe['population']

In [None]:
states_dataframe['population']['California']

In [None]:
states_dataframe['population']['California':'New York' ]

##### Dropping the columns and rows

In [None]:
dataframe = pd.DataFrame(np.arange(16).reshape((4, 4)),
                            index=['uva', 'tud', 'tue', 'tut'],
                            columns=['A', 'B', 'C', 'D'])

In [None]:
dataframe

Dropping the row 

In [None]:
dataframe.drop(['uva']) # By default it will drop the rows.

Dropping the column

In [None]:
dataframe.drop(['A'],axis=1) # Putting axis = 1 will drop the column

You can also change the original datastructure by using `inplace=True`. Note than it might destroy your original datastructure 

In [None]:
dataframe.drop(['A'],axis=1,inplace=True)

In [None]:
dataframe

##### Accessing the values in the dataframe with `loc` : label based and `iloc` : integer based

In [None]:
dataframe = pd.DataFrame(np.arange(16).reshape((4, 4)),
                            index=['uva', 'tud', 'tue', 'tut'],
                            columns=['A', 'B', 'C', 'D'])
dataframe

In [None]:
dataframe.loc['uva',['A','B']]

In [None]:
dataframe.iloc[2,[1,2,3]]

In [None]:
dataframe.loc['uva','A']

Basically there are multiple ways to access the elements of the dataframe, but you don't come across them as frequently. The basic option use to access the values in `Series` or `DataFrame` are more than sufficient for this course and most of the data science work. However there is plethora of literature available on using `Pandas` on books and online

##### Transforming Data using a function or mapping

One of the most important part is data transformation. The whole data science is about data transformation and data manipulation and varied intepretation. We can also transform the `Pandas` column or row data defined by a external function

In [None]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
         'Pastrami', 'corned beef', 'Bacon',
         'pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

In [None]:
meat_to_animal = {
      'bacon': 'pig',
      'pulled pork': 'pig',
      'pastrami': 'cow',
      'corned beef': 'cow',
      'honey ham': 'pig',
      'nova lox': 'salmon'
}
lowercased = data['food'].str.lower()
data['animal'] = lowercased.map(meat_to_animal)
data

In [None]:
data['food'].map(lambda x: meat_to_animal[x.lower()])
data

##### Reading the `CSV` format into `Pandas` dataframe

Let's perhaps try to read world bank dataset as a mock dataset.

In [None]:
data = pd.read_csv("Wealth-AccountsCountry.csv")

In [None]:
data.head()

In [None]:
data.columns.values

In [None]:
data['Latest population census']

In [None]:
data['Latest population census'].dropna()

In [None]:
pd.unique(data['Country Code'])

In [None]:
data_income_group = data.groupby(['Income Group','Region'])

In [None]:
data_income_group.groups

In [None]:
data_income_group.get_group(('Lower middle income',
  'Europe & Central Asia'))['Currency Unit']

### Data Visualisation ~ 20 mins

##### Introduction to Matplotlib and Seaborn

Matplotlib information : <https://matplotlib.org>
Seaborn information : <https://seaborn.pydata.org>

**Some more visualisation tools which will be not cover in this session** : 

ggplotw : <https://ggplot2.tidyverse.org>  \
Grammar of graphics : <https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448>

Interactive data visualisation : 
Bokeh : <https://bokeh.pydata.org/en/latest/>


In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.style.use('classic')

In [None]:
#dir(plt)

In [None]:
#%matplotlib inline
%matplotlib inline

In [None]:
import numpy as np
x = np.linspace(0, 10, 200)

fig = plt.figure()
plt.plot(x, np.sin(x), '-')
plt.plot(x, np.cos(x), '--');

`Seaborn` style is much better aesthetically

In [None]:
plt.style.use('seaborn-whitegrid')

In [None]:
plt.plot(x, np.sin(x), '-')
plt.plot(x, np.cos(x), '--');

##### Simple line plots

In [None]:
x = np.linspace(0, 10, 200)
plt.style.use('seaborn-whitegrid')
plt.plot(x, np.sin(x), '-')
plt.plot(x, np.cos(x), '--');

In [None]:
plt.plot(x, np.sin(x), color='blue',linestyle=':')
plt.plot(x, np.cos(x), color='red',linestyle='-.')
plt.plot(x, x, color='brown',linestyle='--')
plt.plot(x, np.sqrt(x), color='orange',linestyle='-')

In [None]:
plt.plot(x, np.sin(x), color='blue',linestyle='dotted')
plt.plot(x, np.cos(x), color='red',linestyle='dashdot')
plt.plot(x, x, color='brown',linestyle='dashed')
plt.plot(x, np.sqrt(x), color='orange',linestyle='solid')

In [None]:
plt.plot(x, np.sin(x),':b')
plt.plot(x, np.cos(x), '-.r')
plt.plot(x, x, '--c')
plt.plot(x, np.sqrt(x), '-o')

In [None]:
plt.plot(x, np.sin(x), color='blue',linestyle='dotted')
plt.plot(x, np.cos(x), color='red',linestyle='dashdot')
plt.plot(x, x, color='brown',linestyle='dashed')
plt.plot(x, np.sqrt(x), color='orange',linestyle='solid')

### Title of the plot
plt.title('Different plots for different settings')
plt.xlabel("x axix")
plt.ylabel('y axis')


In [None]:
plt.plot(x, np.sin(x), color='blue',linestyle='dotted',label='sin(x)')
plt.plot(x, np.cos(x), color='red',linestyle='dashdot',label='cos(x)')
plt.plot(x, x, color='brown',linestyle='dashed',label='x')
plt.plot(x, np.sqrt(x), color='orange',linestyle='solid',label='sqrt()')

### Title of the plot
plt.title('Different plots for different settings')
plt.xlabel("x axix")
plt.ylabel('y axis')

### Legend
plt.legend()

Simple Scatter plots

In [None]:
x = np.linspace(0, 10, 20)
plt.plot(x, np.sin(x),'o', color='blue',linestyle='dotted',label='sin(x)')
plt.plot(x, np.cos(x),'o', color='red',linestyle='dashdot',label='cos(x)')
plt.plot(x, x,'o',color='brown',linestyle='dashed',label='x')
plt.plot(x, np.sqrt(x),'o', color='orange',linestyle='solid',label='sqrt()')

### Title of the plot
plt.title('Different plots for different settings')
plt.xlabel("x axix")
plt.ylabel('y axis')

### Legend
plt.legend()

In [None]:
x = np.linspace(0, 10, 20)
plt.plot(x, np.sin(x),'o', color='blue',linestyle='dotted',label='sin(x)')
plt.plot(x, np.cos(x),'+', color='red',linestyle='dashdot',label='cos(x)')
plt.plot(x, x,'<',color='brown',linestyle='dashed',label='x')
plt.plot(x, np.sqrt(x),'*', color='orange',linestyle='solid',label='sqrt()')

### Title of the plot
plt.title('Different plots for different settings')
plt.xlabel("x axix")
plt.ylabel('y axis')

### Legend
plt.legend()

In [None]:
x = np.linspace(0, 10, 20)
plt.scatter(x, np.sin(x), color='blue',linestyle='dotted',label='sin(x)')
plt.plot(x, np.cos(x), color='red',linestyle='dashdot',label='cos(x)')
plt.plot(x, x,color='brown',linestyle='dashed',label='x')
plt.plot(x, np.sqrt(x), color='orange',linestyle='solid',label='sqrt()')

### Title of the plot
plt.title('Different plots for different settings')
plt.xlabel("x axix")
plt.ylabel('y axis')

### Legend
plt.legend()

Density and Contour plots

In [None]:
def f(x, y):
    return np.cos(x) ** 15 + np.sin(10 + y * x) * np.cos(x) ** 12

In [None]:
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 40)

X, Y = np.meshgrid(x, y)
Z = f(X, Y)

Lines with the positive values  and dashed lines with the negative value

In [None]:
plt.contour(X, Y, Z, colors='brown');

Gradients with the color map 

In [None]:
plt.contour(X, Y, Z, 20,cmap='RdGy');

In [None]:
plt.contour(X, Y, Z, 50,cmap='RdGy');
plt.colorbar()

Text and annotations

In [None]:
%matplotlib inline

fig, ax = plt.subplots()

x = np.linspace(0, 20, 1000)
ax.plot(x, np.cos(x))
ax.axis('equal')

ax.annotate('local maximum', xy=(6.28, 1), xytext=(10, 4),
            arrowprops=dict(facecolor='black', shrink=0.05))

ax.annotate('local minimum', xy=(5 * np.pi, -1), xytext=(2, -6),
            arrowprops=dict(arrowstyle="->",
                            connectionstyle="angle3,angleA=0,angleB=-90"));

Three dimensional plotting 

In [None]:
from mpl_toolkits import mplot3d

In [None]:
ax = plt.axes(projection='3d')

In [None]:
def f(x, y):
    return np.sin(np.sqrt(x ** 2 + y ** 2))

x = np.linspace(-6, 6, 30)
y = np.linspace(-6, 6, 30)

X, Y = np.meshgrid(x, y)
Z = f(X, Y)

fig = plt.figure()
ax = plt.axes(projection='3d')
ax.contour3D(X, Y, Z, 50, cmap='RdGy')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z');

##### Visualisation of Pandas dataframes with Seaborn

In [None]:
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
import pandas as pd
import seaborn as sns

In [None]:
data = pd.read_csv("Wealth-AccountsCountry.csv")
data.head()

In [None]:
ax = sns.distplot(data['Latest water withdrawal data'].dropna(), color='blue')

In [None]:
ax = sns.distplot(data['Latest water withdrawal data'].dropna(), color='blue',vertical=True)
ax.set_ylabel("Latest water withdrawal data")


In [None]:
data_income_group = data.groupby(['Income Group','Region'])
data_income_group.get_group(('Lower middle income',
  'Europe & Central Asia'))['Latest water withdrawal data']

In [None]:
ax = sns.distplot(data_income_group.get_group(('Lower middle income',
  'Europe & Central Asia'))['Latest water withdrawal data'].dropna(), color='blue')

In [None]:
fig, ax = plt.subplots(5,2,figsize = (20,30))
plt.subplots_adjust(wspace=0.2,hspace=0.4)

ax[0,0].set_title("")
sns.distplot(data_income_group.get_group(('Lower middle income',
  'Europe & Central Asia'))['Latest water withdrawal data'].dropna(), ax = ax[0,0])
