# Big Data Analytics in Python

## Dr. Sagar Jilka
### BRC PhD Teaching, 5th Feb 2019


After completing this course, you should be able to:
1.	Demonstrate familiarity with the Python language and the Jupyter Notebook environment

2.	Understand and demonstrate familiarity with working with raw data, e.g.
a.	Merging datasets
b.	Working with missing data
c.	Aggregating data
d.	Basic plotting

3.	Summarize and analyse data in Python, based on statistics already familiar to you, e.g.:
a.	Obtaining descriptive statistics (mean, median, SD, etc)
b.	T-test (+sig., df)
c.	Correlation (+sig., r)
d.	ANOVA

4.	Explore data through familiar and advanced visualisation techniques, including:
a.	Previously introduced histograms
b.	New ways of data visualisation and graph manipulation

5.	Explore a deeper understanding of industry level data science, including Learning the basics about machine learning and building machine learning models.


# What is Python?

Python is a powerful programming language created by Guido van Rossum. 


![title](..\Sagar_Guido.jpg)

It has simple easy-to-use syntax:

Lets take a quick look at the “Zen of Python”

type import this

In [None]:
import this

### Downloading Python 

You can downloaded from www.python.org

However, best to download a scientific python distribution such as anaconda https://www.anaconda.com/download
Choose Python 3 (Python 2.7 and 3.0 have a few subtle differences in syntax – 3.0 has more help available online so easier to use, 2 will also soon stop being supported... :(


### What is an IDE?

IDE = Integrated Development Environment

A program for writing/running scripts

Makes it easy to write and run code in one program

‘Jupyter’ is a good IDE that comes with the Anaconda installation, as you can see ;)

# (1) Basic Python 

#### Variables

A variable stores some sort of data

The value of a variable is assigned using a single =


In [None]:
my_variable = "Hello folks"

In [None]:
print(my_variable)

In [None]:
my_variable = 1
myVar = 2

In [None]:
my_variable + myVar

##### Heads up - naming!

If you want to name give a variable a name containing multiple words there are two accepted methods:

camelCase = using uppercase letters 
using_underscores

Personal preference – although official Python style suggests using underscores


#### Data Types

Basic data types include strings, integers, floats

String = text (Indicated by wrapping the text in quotes (‘’, or “”) E.g. “hello world”)

Integer = whole numbers (E.g. 5)

Float = numbers with decimals (E.g. 2.351)

Boolean = True or False (Use the command True or False)

In [None]:
myVar2 = "Twenty"
myVar3 = 20
myVar4 = 20.1
myVar5 = True

In [None]:
print(myVar4)

#### Maths

Python can perform mathematical operations on integers or floats

Add +

Subtract –

Divide / 

Multiply *

Modulo % (gives remainder of a division)

## Quick tasks 1 & 2

(1) Create 4 variables representing a string, an integer, a float, and a Boolean

(2) A chocolate factory packs 20 bars into each box. Today they produced 1000 bars of chocolate. Write an algorithm to calculate and output the number of full boxes produced in a day.


(Reminder: string = text, integer = whole number, float = decimals, boolean = True/False)


In [None]:
#2 in simple form
numberOfBars = 1000
numberOfBoxes = (numberOfBars / 20)
print(numberOfBoxes)

In [None]:
#2 in slightly more complicated form.

numberOfBars = int(input("Enter number of bars produced : "))
numberOfBoxes = (numberOfBars/20) 
print("\n**", numberOfBoxes , "'full' boxes have been produced.**")


#### Lists, dictionaries, tuples

Lists are what they seem - a list of values. Each one of them is numbered, starting from zero. You can remove values from the list, and add new values to the end.

Tuples are just like lists, but you can't change their values. The values that you give it first up, are the values that you are stuck with for the rest of the program. Again, each value is numbered starting from zero. 

Dictionaries are similar to what their name suggests - a dictionary. In a dictionary, you have an 'index' of words, and for each of them a definition. In python, the word is called a 'key', and the definition a 'value'. The values in a dictionary aren't numbered - they aren't in any specific order, either - the key does the same thing. You can add, remove, and modify the values in dictionaries.

In [None]:
myList = [1, 2, 3, 4, 5, "Captain", "America"]

In [None]:
print(myList)
myList[5]

In [None]:
myList.append("Avengers")

In [None]:
myList

In [None]:
myList2 = [975, 2454, 'Hello']

In [None]:
myTuple = ('January','February','March','April','May','June','July','August','September','October','November','  December')

In [None]:
#myTuple.append('sadf')

In [None]:
myTuple[0]

In [None]:
#Make the phone book:
phonebook = {'Andrew Parson':8806336, \
'Emily Everett':6784346, 'Peter Power':7658344, \
'Lewis Lame':1122345}

In [None]:
phonebook

In [None]:
phonebook['Gingerbread Man'] = 1234567

In [None]:
phonebook

In [None]:
## Good website: http://sthurlow.com/python/lesson06/

#### Loops

In [None]:
for i in myList:
    print(i)

# (2) Working with data 

Python has many powerful packages to help us analyse data. The most commonly used are Numpy and Pandas.

Pandas stands for “Python Data Analysis Library”. It's great for dealing with spreadsheet-style data – by creating ‘Dataframes’. You can then easily select columns from datasets and apply functions to them.

#### Pandas 

Now we will do some actual analysis with real life data using Pandas. The plan is:

1) Read in a csv file for analysis

2) Understand pandas and dataframes

3) Clean and pre-process data (e.g. fill missing values)

4) Compute basic stats

5) Merge datasets

6) Aggregate data

In [None]:
# Your Anaconda Python distribution comes with Pandas ready to use. All you have to do to use it is ‘import’ pandas. 
# Common abbreviation to use is ‘pd’ as below. 

import pandas as pd

In [None]:
# you can see your current path by typing !cd
!cd

In [None]:
# explore pandas read_ function. You can import all sorts of files, e.g. excel, html, sql etc...
# explore the function arguments by hitting tab inside the green parantheses.
# the below function will import the data based on your arguments and create a dataframe assigned to the variable called ratings

df = pd.read_csv("brain_size_clindata.csv")

In [None]:
df.head()

In [None]:
# you can view your new dataframe by typing the variable name
#df

# you can view the first five rows by typing .head() after the variable name
#df.head()

# you can view the last five rows by typing .tail() after the variable name
#df.tail()

# Slicing: you can select a particular portion of the dataframe (i.e. slicing) by telling python which rows you want to slice inside []
#df[22:25]

# Experiment with slicing e.g. df[:30] vs. df[30::2]

df.head(3)

In [None]:
# Now we need to import the ppt information dataset

df2 = pd.read_csv("brain_size_ppts.csv")

In [None]:
df2.head()

In [None]:
# use slicing to remove the random 'unnamed:0' column from df2
#df2 = df2[['ppt_id' , 'Gender' , 'Weight' , 'Height']]

In [None]:
df2.head()

In [None]:
df.head()

#### Merging Datasets

Explain about merges here - have a look at the SQL images

In [None]:
# create one merged DataFrame
df_all = df.merge(df2, on = ['ppt_id', 'Unnamed: 0'], how = 'left')

In [None]:
# Great, looks like it has worked (no errors!)
# How cna we take a look? What do we need to type?

In [None]:
# Lets keep only the columns we need. You can 'filter out' columns by using the below syntax

df_all = df_all[['ppt_id', 'FSIQ', 'VIQ', 'PIQ', 'MRI_Count', 'Gender', 'Weight', 'Height', 'Diagnosed' ]]

In [None]:
df_all.head()

In [None]:
# If you had column names that were different, you could rename them:

# df.rename(columns={'pptId': 'ppt_id'}, inplace=True)

# Note the use of a dictionary there.

#### Working with missing data 

Explain the fillna method and various ways to fillna...

You could also just df_all.drop, but this would remove important data. 

How else could you impute missing cell values?


In [None]:
# Consider

# fillna(0) fillna("MissingValue") fillna(-99) etc

In [None]:
df_all.describe()

In [None]:
df_all.isnull()

In [None]:
# Sum up the output above to show the total number of missing values per column

df_all.isnull().sum()

In [None]:
# Shows the rows where there are missing values

df_all[df_all.isnull().any(axis=1)]

In [None]:
df_all.mean()

In [None]:
# You could run this code below, which would fill the missing Weight values with the whole
# column means. Is this a good idea? Why, why not?

#df_all["Weight"].fillna(df_all["Weight"].mean())

In [None]:
df_all.groupby("Gender").mean()

In [None]:
df_all.Weight.fillna(166.444444, inplace=True)

In [None]:
df_all.Height.fillna(413.052632, inplace=True)

In [None]:
df_all.head()

In [None]:
# Now that we have filled the missing values, you will see that there are no rows with
# missing values

df_all[df_all.isnull().any(axis=1)]

# Can anyone guess what the axis argument is doing?

In [None]:
# Now consider if you had a HUGE dataset with multiple columns and multiple groups (i.e
# more than just male and female). It would take time to fillna

df_all["Weight"].fillna(df_all.groupby("Gender")["Weight"].transform("mean"), inplace=True)

In [None]:
df_all[df_all.isnull().any(axis=1)]

In [None]:
# And now lets fill missing Height values

df_all["Height"].fillna(df_all.groupby("Gender")["Height"].transform("mean"), inplace=True)

In [None]:
# And voila... no more missing values!

df_all[df_all.isnull().any(axis=1)]

#### Aggregating data 

One of my favourite pandas functions is the groupby() function.

In [None]:
# Experiment with .median, .describe, consider adding other categorical variables in the
# groupby and then aggregating. 

# Introduce the .to_clipboard() function, and the transpose function, which are useful 
# when writing a paper.


df_all.groupby("Gender").mean()#.T

In [None]:
# What if you want to investigate two categorical variables?

cols_to_grpby = ['Gender', 'Diagnosed']

df_all.groupby(cols_to_grpby).mean()

In [None]:
df_all.groupby(cols_to_grpby).median()

In [None]:
df_all.groupby(cols_to_grpby).describe().T

In [None]:
# Prepare your tables for publication? 
# I copy to clipboard and then format in Excel..

df_all.groupby(cols_to_grpby).describe().T.to_clipboard()

## Data Visualisation

Start with basic bar/line plots, learning seaborn methods, and then something interactive

In [None]:
#we need to import more libraries, notably matplotlib and seaborn

import matplotlib.pyplot as plt
from IPython.display import display
import seaborn as sns

#note the magic function below. This is important because it will allow the plots you
#create to appear here in the notebook.
%matplotlib inline


In [None]:
xx = df_all["FSIQ"] # you can replace FSIQ with any other column.. try it
plt.hist(xx, bins = 10);

Now lets consider the seaborn library, which we will use as seaborn is generally easier, and make the images much better!

In [None]:
sns.distplot(df_all['Weight'], kde=False, rug=True, bins = 10)

We want to investigate:


(1) Differences in IQ/brain size between men and women?

(2) Any correlations?

** Now is a good time to explore the seaborn library. 
I will show you how to use the examples and apply them to your work **

In [None]:
#https://seaborn.pydata.org/generated/seaborn.catplot.html?highlight=catplot#seaborn.catplot

sns.catplot(data=df_all,
            x="Gender", 
            y="PIQ", 
            #palette={"Male": "blue", "Female": "pink"}, # you can also specify colours!
            kind="strip"); #experiment with different kinds, e.g. “bar”, “strip”, “swarm”, “box”, “violin”, or “boxen”.



In [None]:
# Lets add another categorical variable to the graph:
# We want to add Diagnosed or not to the figure

sns.catplot(data=df_all,
            x="Gender", 
            y="PIQ", 
            hue = "Diagnosed",
            #palette={"Male": "blue", "Female": "pink"}, # you can also specify colours!
            kind="bar"); #experiment with different kinds, e.g. “bar”, “strip”, “swarm”, “box”, “violin”, or “boxen”.



In [None]:
# Lets do some correlations and make a correlation matrix

corr = df_all.corr()

In [None]:
corr

In [None]:
# Now lets add some colour to help differentiate between variables

corr.style.background_gradient()#.set_precision(2)

If you want to calculate pearson r then you can use the function below, don't worry about the code, I will show you to implement it

In [None]:
from scipy.stats import pearsonr
import pandas as pd

def calculate_pvalues(df):
    df = df.dropna()._get_numeric_data()
    dfcols = pd.DataFrame(columns=df.columns)
    pvalues = dfcols.transpose().join(dfcols, how='outer')
    for r in df.columns:
        for c in df.columns:
            pvalues[r][c] = round(pearsonr(df[r], df[c])[1], 4)
    return pvalues

In [None]:
calculate_pvalues(df_all)

In [None]:
#Exploring scatterplots

#https://seaborn.pydata.org/generated/seaborn.scatterplot.html

In [None]:
sns.scatterplot(x="FSIQ", y="PIQ", data=df_all)

In [None]:
#plt.figure(figsize=(10,8))

sns.scatterplot(data=df_all,
                x="FSIQ", 
                y="PIQ", 
                hue="Gender",) #experiment with different categorical 'hues', e.g. Diagnosed
                #size = "Height") # Explore what the size argument does...


In [None]:
# Saving your figures for publication

# assign the code to a random variable (something like g = )

g = sns.scatterplot(data=df_all,
                x="FSIQ", 
                y="PIQ", 
                hue="Gender",) #experiment with different categorical 'hues', e.g. Diagnosed
                #size = "Height") # Explore what the size argument does...

    
# Use the savefig argument and provide a filename, such as figure1.png
# You can use keyword arguments based on your journal's submission requirements
# For exmaple, for the British Journal of Psychiatry (https://www.cambridge.org/core/services/authors/journals/journals-artwork-guide)
# they want images at 300dpi and ideally in TIFF format, so you can save your figure appropriately
# So check with your submission guidelines and provide the appropriate arguments
# Tip: for posters, you might want to use the transparancy argument

#g.figure.savefig("output.tiff", dpi = 300) #transparency = True)

# Statistics in Python

##### IV = Gender (Male, Female), also Diagnoses (Yes, No)
##### DV = IQ Scores (FSIQ, VIQ, PIQ)

We want to test if there is a difference in IQ measures between genders? Or "diagnosed"? What tests can we use?

In [None]:
# There are two libraries (that i know of anyway) to do stats in Python - scipy and statsmodels

# Lets start with scipy and we will compare the syntax and output with statsmodel

# Import as below...

import scipy.stats as stats

### Assumption of normality

#### Shapiro-wilk test (output = w test statistic, p value)

In [None]:
stats.shapiro(df_all["FSIQ"])

# consider the difference between the above and the below commented code:

#stats.shapiro(df_all["FSIQ"][df_all['Gender'] == 'Male'])

In [None]:
# You can also test other DVs

print(stats.shapiro(df_all["PIQ"]))
print(stats.shapiro(df_all["VIQ"]))

In [None]:
# Now lets make some Q-Q Plots
stats.probplot(df_all["FSIQ"],plot= plt)

# Give your figure a title
plt.title("FSIQ Q-Q Plot");

In [None]:
# Lets make a loop and test all three of our IQ DVs in one go:

# First make a list variable called cols with your DVs in it
cols = ['FSIQ', 'PIQ', 'VIQ']

# Then make a loop using the for statement. This will loop through every item in cols
# and do you tell it to do, in this case, we are telling it to do stats.shapiro
for i in cols:
    print(i)
    print(stats.shapiro(df_all[i][df_all['Gender'] == 'Male']))
    
for i in cols:
    stats.probplot(df_all[i][df_all['Gender'] == 'Male'], plot= plt)
    plt.title("Mental Health Q-Q Plot")
    
print("\n\nAssumption of normality is violated as (all) the p-values are < than 0.05.")

##### Levene's Test 

In [None]:
levene_1 = df_all["PIQ"][df_all['Gender'] == 'Male']
levene_2 = df_all["PIQ"][df_all['Gender'] == 'Female']

In [None]:
stats.levene(levene_1, levene_2)

In [None]:
# Lets make a quick loop so we can run a levene's test on all our DVs

for i in cols:
    print(i , ':' , stats.levene(df_all[i][df_all['Gender'] == 'Male'], 
                                 df_all[i][df_all['Gender'] == 'Female']))

##### ANOVA 

In [None]:
# We will statsmodels because the output is better (more readable)

import statsmodels.api as sm
from statsmodels.formula.api import ols

# Documentation here: https://www.statsmodels.org/stable/index.html

In [None]:
results = ols("FSIQ ~ C(Gender)", data = df_all).fit()

In [None]:
results.summary()

In [None]:
print("The adfsaf asdf , F(%f, %f) = %f , p = %f" %(results.df_resid, 
                                                    results.df_model,
                                                    results.fvalue, 
                                                    results.f_pvalue))

In [None]:
# Consider writing a function that writes out your results section for you!
# You need to find the following bits of info from the above...

# F(df effect, df error) = F-value, MSE = mean-square error, p-value". 
# e.g., "IQ scores did/didn't differ significantly between genders, F(X,XX) = XXXX, MSE = XXX, p = XXX.

# You can find each of those individual bits of data by typing results. then hit tab and 
# you will be able to see all the methods that the results object contains!

# For instance:

#print(results.df_model)
#print(results.df_resid)
#print(results.fvalue)
#print(results.f_pvalue)
#print(results.mse_total)

# Now you jsut need to put all that together...

In [None]:
def res_output(dv, iv, df):

        # So first make two variables that represent the IV and DV
        x = ("~ C(%s)" %iv)
        y = str(dv + x)
        
        # Then make the model.
        results = ols(y, data=df).fit()
        
        # Then make a statement which prints out an appropriate statement 
        # based on the p-value...
        
        if results.f_pvalue > 0.05:
            
            print("A one-way ANOVA was conducted to compare difference in %s between %s. We found no significant difference between %s,\
                   F(%f, %f) = %f , p = %f" %(dv, iv, iv,
                                              results.df_model, 
                                              results.df_resid,
                                              results.fvalue, 
                                              results.f_pvalue))
        else:
            print("A one-way ANOVA was conducted to compare difference in %s between %s. We found a significant difference between %s\
                   F(%f, %f) = %f , p = %f" %(dv, iv, iv,
                                              results.df_model, 
                                              results.df_resid,
                                              results.fvalue, 
                                              results.f_pvalue))
        
        # If you fancy, you can tell the function to return an output such as the
        # summary table, if so, uncomment the bottom bit below...
        
        #return results.summary()

In [None]:
# Now you can call your function and give it your arguments like below...

res_output(dv = "VIQ", 
           iv = "Gender", 
           df = df_all) 

# Machine Learning - Would you survive the titanic?!

To learn the concept of machine learning (ML), lets predict if YOU would survive the titanic based on the passenger and survival history.

Python has a powerful library called sci-kit learn (sklearn) which does everything for you!

In [None]:
#Load up the titanic3.xls into a pandas dataframe

titanic_df = pd.read_excel('titanic3.xls', 'titanic3', na_values=['NA'])

The columns refer to:

- survival - Survival (0 = No; 1 = Yes)
- pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- name - Name
- sex - Sex
- age - Age
- sibsp - Number of Siblings/Spouses Aboard
- parch - Number of Parents/Children Aboard
- ticket - Ticket Number
- fare - Passenger Fare
- cabin - Cabin
- embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
- boat - Lifeboat (if survived)
- body - Body number (if did not survive and body was recovered)
- home.dest - home destination

In [None]:
titanic_df.head()

Firstly lets explore the dataset,

What was the survival rate?
Can you make a table of gender vs. class and survival.
Can you plot what you are interested in.

In [None]:
titanic_df['survived'].mean()*100

In [None]:
titanic_df.groupby(['sex', 'pclass']).mean()['survived']*100

In [None]:
sns.factorplot(data = titanic_df, x = 'sibsp', y = 'parch', aspect = 2)

What features (aka columns) do you think will be important for us to predict survival?

In [None]:
# We can drop some columns..

titanic_df = titanic_df.drop(['body','cabin','boat', 'home.dest' , 'embarked'], axis=1)

In [None]:
# lets clean the data in prep for ML...
# ML models works only with numbers, can anyone see an issue with the dataset?

titanic_df['sex'] = titanic_df['sex'].map({'female': 1, 'male': 0})


In [None]:
titanic_df.head(2)

In [None]:
# ML also doesn't work with missing values, so we can impute them...
# for this demonstration, I'm just going to get rid of empty data

titanic_df.dropna(inplace=True)

In [None]:
#lets see if we have any missing rows...

titanic_df[titanic_df.isnull().any(axis=1)]

Now lets create our features and response variables (X and y) 

In [None]:
# store the feature matrix (X) and response vector (y)

X = titanic_df[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare']]

y = titanic_df['survived']

In [None]:
# check the shapes of X and y to make sure the number of rows match up
# why is this important?

print(X.shape)
print(y.shape)

In [None]:
# lets look at our features.. can anyone see what's wrong here?
X.head()

In [None]:
# lets look at y
y

### 4 steps to ML in Python... IIFP

#### (1) Import
#### (2) Instantiate
#### (3) Fit
#### (4) Predict

In [None]:
# (1) import the class, in this case we will use k-neighbours

from sklearn.neighbors import KNeighborsClassifier

In [None]:
# (2) instantiate the model (with the default parameters)

knn = KNeighborsClassifier(n_neighbors=25)#weights = 'uniform')

In [None]:
# (3) fit the model with data (occurs in-place)

knn.fit(X, y)
#knn.score(X, y)

In [None]:
X.columns

In [None]:
# (4) Predict
# Enter your data in the same order of the X columns:
# There should be 6 numbers:

knn.predict([[2, 0, 32, 0, 1, 32]])

We can see that the model has predicted that the new observation’s class is 0 (i.e. I wouldn't have survived!).

We can even look at the probabilities the model assigned to each class to see how confident it is:

In [None]:
knn.predict_proba([[2, 0, 32, 0, 1, 32]])

According to this result, the model predicted that the observation was dead with a ~52% probability and survive with a ~48% probability. 

Because the observation had a greater probability of being dead, it predicted that class for the observation.

Lets try another model (Random Forest) but you can experiment with many more...

- Random Forest
- Perceptron
- SGDC
- Decision tree


In [None]:
# First (1) import
from sklearn.ensemble import RandomForestClassifier
#from sklearn.linear_model import Perceptron
#from sklearn.linear_model import SGDClassifier
#from sklearn.tree import DecisionTreeClassifier

In [None]:
# Then (2) instantiate
random_forest = RandomForestClassifier(n_estimators=100)

In [None]:
# Then (3) fit

random_forest.fit(X, y);

In [None]:
# Then (4) predict

random_forest.predict([[2, 0, 32, 0, 1, 35.23]])

In [None]:
random_forest.predict_proba([[2, 0, 32, 0, 1, 35.23]])

Lets see which of our features were the most important to this model:

In [None]:
# Make a dataframe
importances = pd.DataFrame({'feature':X.columns,
                            'importance': random_forest.feature_importances_})

# Sort the value, set the index 
importances = importances.sort_values('importance',ascending=False).set_index('feature')

# Display the first 15 rows of the dataframe
importances.head(15)

In [None]:
0.296 + 0.286 + 0.258 + 0.083 + 0.041 +0.036

In [None]:
importances.importance.sum()

In [None]:
importances.plot.bar()

In [None]:
random_forest.score(X, y)