# Introduction to Python (Part 3)

## Load and explore the dataset 

Modules in python are similar to packages in R. In order to use them you need to install them and then import them. The two easiest ways to install modules are using **pip install** and **conda install** followed by the name of the module. If you are unsure about the exact command and you use conda, I always advise to go to the https://anaconda.org/anaconda/ website and just search the module you need. You will find all information there!

In order to import a module into python you have multiple options:

1) You either import the full module by saying import [module name]

2) You can also import only one function of the module by specifying: from [module name] import [function name]

3) Usually when you import a module you give it a name: import numpy as np. In this way everytime you need to use functions from the numpy package you can call it with np

In [None]:
# import breast cancer dataset from scikit-learn
from sklearn.datasets import load_breast_cancer

In [None]:
# load dataset
breast_data = load_breast_cancer().data

Let's look at the shape of the dataset (how many columns and rows there are)

In [None]:
# check the shape of the data
breast_data.shape

In [None]:
# load the labels and check the shape
breast_labels = load_breast_cancer().target
breast_labels.shape

In [None]:
import numpy as np

# reshape the labels and concatenate the data and labels together
# each row: data + label
labels = np.reshape(breast_labels,(569,1))
final_breast_data = np.concatenate([breast_data,labels],axis=1)

In this chunk you can see that the function reshape belongs to the numpy module. Since we defined the numpy module as np, we can simply call the module by writing np.reshape.

If you try to print final_breast_data, you will notice that it is really hard to understand and interpret. This is mainly due to the fact that when you print it, you can only see a set of arrays. It would be, however, much better and easier to have a dataframe!!


In [None]:
# Try to print the final_breast_data and see how it looks. Is it easy/difficult to understand?




Difficult right? Let's create a dataframe.

In Python, in order to work with dataframes there is a specific module that must be used. This module is called **pandas**. Pandas can be a bit more tricky to use if compared to simple data manipulation in R, but it's really powerful. 


In [None]:
import pandas as pd

# creat a dataframe from numpy array
breast_dataset = pd.DataFrame(final_breast_data)

# Try to check how the dataframe looks now. Is it better and easier to understand? 



Can you see what is the problem with the current dataframe? The issue is that the column and row names don't have any name... what is the meaning of all these numbers??? In order to properly understand the dataframe, we should add column names.

The information about the features contained in the dataframe is contained inside the feature_names of the data. Let's collect them in a variable called features and check them out.

In [None]:
# load feature names
features = load_breast_cancer().feature_names
features

Now that we have the features we just need to add them to the right columns. In pandas there is an easy way to add column names. Let's also give a common label to all these columns and call it features. 

In [None]:
# add column names to the dataframe
breast_dataset.columns = np.append(features,'label')

Now that you added the columns try to print the dataframe again. This time, try to print only the head or the tail of the dataframe. This can be useful when you have a lot of data and you don't need to check them all out. Printing a huge dataframe can be computationally expensive, and if not needed, it's always better to avoid it.

In [None]:
# let's have a look of first five rows of the dataframe
breast_dataset.head(5)

# Try to print the first 10 rows of the data frame


# Now try to print the last 10 rows (Tip: use tail)



In [None]:
'''
For dataframe, if you want to select specific rows and columns, use "dataFrame.loc[]" with names and "dataFrame.iloc[]" with indices. 
In the parenthesis you first put the rows and then the columns. 
'''
# Try to select the first element of the first row
breast_dataset.loc[0, 'mean radius']

17.99

In [None]:
'''
If you write "dataFrame.loc[:, 'mean radius']" you spoecify that you want to keep all the rows 
and columns with name "mean radius". 
'''
# Try to select the first two columns
breast_dataset.loc[:, ['mean radius', 'mean texture']]

Now that we have our dataframe with all the features, we just need to apply some kind of normalization. Everytime you work with data, you always need to normalize them before doing any kind of analysis of visualization. 

A common way to normalize data is based on the mean and standard error. Python has a super nice module with a function that does it automatically for you, which is called StandardScaler. 

In [None]:
from sklearn.preprocessing import StandardScaler

# get all feature values and normalise the data
x = breast_dataset.loc[:, features].values
x = StandardScaler().fit_transform(x)

# The reason why you want to keep only these columns is that these are the ones
# you need to normalize


In [None]:
# check the shape of the data
x.shape

Since we normalized the data based on the mean and standard error, in order to check whether the normalization worked, we have to see whether the mean was scaled to 0 and the standard error to 1. 

In [None]:
# check whether the data is normalised
np.mean(x),np.std(x)

As you can see the normalization was successfull!!! Great job :) 

## Visualisation

### Principal Component Analysis (PCA)

Principal component analysis is a dimensionality reduction method. It is used to reduce the dimensions while, at the same time, try to keep most of the information. The new created components are obtained by maximizing the variance of the original dataset, with n features. PCA can be commonly used to visualize the data, by summarizing the n features into 2 principal components and by plotting them in the (x, y) coordinate space. You can potentially have as many pricipal components as the number of original features. But usually it is common to simply keep the firsts one, since they are the ones that explain most of the variance within the data.

In [None]:
from sklearn.decomposition import PCA

# project the thirty-dimensional Breast Cancer data to two-dimensional space
pca_breast = PCA(n_components=2)
principalComponents_breast = pca_breast.fit_transform(x)

A nice way to visualize the coordinatex established by the newly obtained PC is through a dataframe. 

In [None]:
# creat a dataframe of the projected data
principal_breast_Df = pd.DataFrame(data = principalComponents_breast, columns = ['principal component 1', 'principal component 2'])
principal_breast_Df.head()

In [None]:
print('Explained variation per principal component: {}'.format(pca_breast.explained_variance_ratio_))

Explained variation per principal component: [0.44272026 0.18971182]


As you can see, the first principal component explain almost 45 percent of the variance. Instread the second one explains already only 18 percent. The farther down you move from the first component, the least variance is explained, which means that the mroe information contained in the original features is loss. 

In [None]:
# for loop through a list
fruits = ["apple", "banana", "cherry"]
for i in fruits:
    print(i)

In [None]:
# for loop through a string
for i in "fruits":
    print(i)

In [None]:
# for loop through a range
for i in range(5):
    print(i)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

Matplotlib is a common python module to create plots in Python. You can decribe it as the python "ggplot". It is really straighforward to use. You always start the command with plt. (which is the way you usually name the module when you import it) and then add directly after the command you need. 

The first thing to do define the size of the figure, the labels names and sizes, the title, the you can plot concretely your image

In [None]:
# plot the 2D PCA projection
# Establish the dimensions of the figure you want to plot
plt.figure(figsize=(10,10))
# Specify the font of the x and y ticks
plt.xticks(fontsize=12)
plt.yticks(fontsize=14)
# Specify the x and y labels
plt.xlabel('Principal Component - 1',fontsize=20)
plt.ylabel('Principal Component - 2',fontsize=20)
# Give a title
plt.title("Principal Component Analysis of Breast Cancer Dataset",fontsize=20)
# Specify targets and colors (you want to color the points based on the condition)
targets = [0, 1]
colors = ['r', 'g']
for target, color in zip(targets,colors):
    # Keep the data that are either healthy or malign (have either target 0 or 1)
    indicesToKeep = breast_dataset['label'] == target
    # Plot a scatteplot where the X is the PC1 and y is PC1
    plt.scatter(principal_breast_Df.loc[indicesToKeep, 'principal component 1'], 
                principal_breast_Df.loc[indicesToKeep, 'principal component 2'], c = color, s = 50)
# Plot the legend
plt.legend(['Benign', 'Malignant'],prop={'size': 15})
# Show the plot
plt.show()

As you can see from the plot, the two principal components can split very well the samples that are benign and malign. This means two things: the samples are clearly distinct between each other, and the principal components obtained could summarize well the n features of the original data.

If you want to play a bit with the plot, you can try to change the colors, titles, sizes, labels, legends .... Let us know if you don't understand something, we are here to help!

### t-distributed Stochastic Neighbor Embedding (t-SNE)

Principal component analysis is a nice visualization method, but it is linear. That's why sometimes it cannot fully captures all the information contained in the data. On the other hand, t-sne is a really nice, easy to use, visualization method.

Disadvantage: it is stochastic so the plot could change every time you run the code!!! Also, if you change the perplexity variable the plot will look differently. How do you choose the right perplexity? That's a good question, for which there is no good answer. I would say... Try more values and see what you get!!

In [None]:
from sklearn.manifold import TSNE

tsne_breast = TSNE(n_components=2, perplexity=30).fit_transform(x)

# plot the 2D t-SNE projection
plt.figure(figsize=(10,10))
plt.xticks(fontsize=12)
plt.yticks(fontsize=14)
plt.xlabel('t-SNE - 1',fontsize=20)
plt.ylabel('t-SNE - 2',fontsize=20)
plt.title("t-distributed Stochastic Neighbor Embedding of Breast Cancer Dataset",fontsize=20)
targets = [0, 1]
colors = ['r', 'g']
for target, color in zip(targets,colors):
    indicesToKeep = breast_dataset['label'] == target
    plt.scatter(tsne_breast[indicesToKeep, 0], tsne_breast[indicesToKeep, 1], c = color, s = 50)

plt.legend(['Benign', 'Malignant'],prop={'size': 15})
plt.show()

## Classification

Classification is commonly used to understand whether the data you have can distinguish between two or more classes. This is a type of machine learning (together with regression). In order to do machine learning, you always have to split your data into training and test set. The training set is used to train the model, and the test one is used to evaluated the recently trained model on unseen data. 

It is common in machine learning to do a 5-fold cross validation. This consists in dividing your data into 5 groups, then select a train and test set per group and finally train a model using each training set. In the end, then, you would have 5 different trained models. 

This is commonly done to prove that your results are really significant and not due to chance, or to select hyperparameters. 

<img src="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png" width="600">

In [None]:
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

kf = KFold(n_splits=5)

acc = 0

# get training set indices and testing set indices
for train_index, test_index in kf.split(x):
    
    X_train, X_test = x[train_index], x[test_index]
    y_train, y_test = breast_dataset['label'][train_index], breast_dataset['label'][test_index]

    # fit a 5 nearest neighbour model with training set
    clf_knn = KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train)
    # predict the labels for testing set
    y_pred = clf_knn.predict(X_test)
    # calculate the accuracy for this model
    acc += accuracy_score(y_test, y_pred)/5

In [None]:
# 5-fold cross validation accuracy
acc

### References
1. https://www.datacamp.com/community/tutorials/principal-component-analysis-in-python
2. https://www.datacamp.com/community/tutorials/introduction-t-sne
3. https://scikit-learn.org/stable/_images/grid_search_cross_validation.png