Collaborative coding using GitHub
===========

Alexandre Perera Luna, Mónica Rojas Martínez

December 15th 2023


# Goal

The objective of this assignment is to construct a project through collaborative coding, showcasing an Exploratory Data Analysis (EDA) and a classification. To facilitate your understanding of GitHub, we will utilize code snippets from previous exercises, allowing you to focus on the process without concerns about the final outcome. The current notebook will serve as the main function in the project, and each participant is required to develop additional components and integrate their contributions into the main branch.


## Requirements

In order to work with functions created in other jupyter notebooks you need to install the package `nbimporter` using a shell and the following command:

<font color='grey'>pip install nbimporter</font> 

`nbimporter` allows you to import jupyter notebooks as modules. Once intalled and imported, you can use a command like the following to import a function called *fibonacci* that is stored on a notebook *fibbo_func* in the same path as the present notebook:

<font color='green'>from</font> fibbo_func <font color='green'>import</font> fibbonaci  <font color='green'>as</font> fibbo



In [1]:
## Modify this cell by importing all the necessary modules you need to solve the assigmnent. Observe that we are importing
## the library nbimporter. You will need it for calling fuctions created in other notebooks.
# Uncomment the line below if nbimproter not installed
!pip install nbimporter

import nbimporter
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns


%matplotlib inline



In [12]:
# Here is an example of invoking the Fibonacci function, whisch should be located in the same directory as the main:
from fibbo_func import fibbonaci as fibbo
fibbo(24)

46368

## Exercises
As an illustration of Git workflow, you will analyze the *Parkinson's* dataset, which has been previously examined in past assignments. Each team member has specific responsibilities that may be crucial for the progress of others. Make sure all of you organize your tasks accordingly. We've structured the analysis into modules to assist you in tracking your tasks, but feel free to deviate from it if you prefer.   
Please use Markdown cells for describing your workflow and expalining the findings of your work. 
Remember you need both, to modify this notebook and, to create additional functions outside. Your work will only be available for others when you modify and merge your changes.


In [13]:
# We will start by loading the parkinson dataset. The rest is up to you!
df = pd.read_csv('parkinsons.data', 
                 dtype = { # indicate categorical variables
                     'status': 'category'})
df.head(5)

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


### 1. Cleaning and tidying the dataset

For cleaning and tidying the dataset the names of the variables were changed and the missing rows and duplicated rows eliminated. 

In [4]:
from fibbo_func import renamevars

dict_names = {'MDVP:Fo(Hz)':'avFF',
              'MDVP:Fhi(Hz)':'maxFF',
              'MDVP:Flo(Hz)':'minFF',
              'MDVP:Jitter(%)': 'percJitter',
              'MDVP:Jitter(Abs)':'absJitter' ,
              'MDVP:RAP': 'rap',
              'MDVP:PPQ': 'ppq',
              'Jitter:DDP': 'ddp',
              'MDVP:Shimmer' : 'lShimer',
              'MDVP:Shimmer(dB)': 'dbShimer',
              'Shimmer:APQ3':'apq3',
              'Shimmer:APQ5': 'apq5',
              'MDVP:APQ':'apq',
              'Shimmer:DDA':'dda'}
df=renamevars(df, dict_names)

df = df.dropna()
df = df.drop_duplicates()
df.head()


### 2. Basic EDA based on plots and descriptive statistics

Before starting to create the plots, I need to import the necessary `scat_plt` function and then call it to generate the plots.

In [None]:
from fibbo_func import scat_plt as scat_plt

To have a better overview of the correlation between variables using scartter plots, we have implemented a modified version of the `scat_plot` function called `scat_plt_2` that arranges the different scatter plots in a grid.

In [None]:
from fibbo_func import scat_plt_2 as scat_plt_2

Now that the needed functions are imported, we are able to create the scatter plots. We will start with the **Fundamental Frequency** (FF) variables. There are 3 FF variables, so we will be needing k · (k-1)/2 plots. As k = 3 in this case we then need (3 * 2)/2 = 3 plots.

In [None]:
ff = df.columns[1:4]

# Create a 1x3 grid for the subplots
fig, axs = plt.subplots(1, 3, figsize=(15,5))


# Loop through all pairwise combinations and create subplots
for (i, j), ax in zip([(i, j) for i in range(len(ff)) for j in range(i + 1, len(ff))], axs.flatten()):
    scat_plt_2(df[ff[i]], df[ff[j]], df.status, ax)
    ax.set_title(f'{ff[i]} vs {ff[j]}')  # Set subplot title

# Hide any remaining empty subplots
for ax in axs.flatten()[len(ff) * (len(ff) - 1):]:
    ax.axis('off')

plt.tight_layout()
plt.show()

We can see from the plots that all the FF variables are highly correlated. We can do the same for the 4 **Jitter** variables, where 6 scatter plots result from the pairwise comparison.

In [None]:
jitter = df.columns[4:8]

# Create a 3x3 grid for the subplots
fig, axs = plt.subplots(2, 3, figsize=(15, 10))


# Loop through all pairwise combinations and create subplots
for (i, j), ax in zip([(i, j) for i in range(len(jitter)) for j in range(i + 1, len(jitter))], axs.flatten()):
    scat_plt_2(df[jitter[i]], df[jitter[j]], df.status, ax)
    ax.set_title(f'{jitter[i]} vs {jitter[j]}')  # Set subplot title

# Hide any remaining empty subplots
for ax in axs.flatten()[len(jitter) * (len(jitter) - 1):]:
    ax.axis('off')

plt.tight_layout()
plt.show()

Finally, we use the function to analyse the correlation for all the **Shimmer** variables. In this case we will be needing (6 * 5) / 2 =  15 plots.

In [None]:
shimmer = df.columns[9:15] # Selection of Shimmer variables

# Create a 3x5 grid for the subplots
fig, axs = plt.subplots(3, 5, figsize=(15, 10))

# Loop through all pairwise combinations and create subplots
for (i, j), ax in zip([(i, j) for i in range(len(shimmer)) for j in range(i + 1, len(shimmer))], axs.flatten()):
    scat_plt_2(df[shimmer[i]], df[shimmer[j]], df.status, ax)
    ax.set_title(f'{shimmer[i]} vs {shimmer[j]}')  # Set subplot title

# Hide any remaining empty subplots
for ax in axs.flatten()[len(shimmer) * (len(shimmer) - 1):]:
    ax.axis('off')

plt.tight_layout()
plt.show()

To see if correlation exists, it is often helpful to generate a visual correlation matrix. For example, for Jitter and Shimemr variables correlation plots are displayed, as well as the most correlated variable (the one with highest sum of correlation coefficients). 

In [None]:
correlation_matrix = df[jitter].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix of Jitter Variables')
plt.show()

# This code identifies the most correlated variable with the rest by summing the corr values
most_correlated_variable = correlation_matrix.abs().sum().idxmax()
print('\nThe most correlated variable turns out to be', most_correlated_variable)

In [None]:
correlation_matrix = df[shimmer].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix of Shimmer Variables')
plt.show()

# This code identifies the most correlated variable with the rest by summing the corr values
most_correlated_variable = correlation_matrix.abs().sum().idxmax()
print('\nThe most correlated variable turns out to be', most_correlated_variable)

### 3. Aggregating and transforming variables in the dataset

For aggregating and transforming the data, we will be needing the auxiliary functions `group_and_average` and the `normalize` functions.

In [None]:
from fibbo_func import group_and_average
from fibbo_func import normalize

Now we apply the normalize function to the aggregated data using the Z-score and the min-max option. We obtain two different noramlized dataframes.

In [None]:
norm_z_df = normalize(av_df, 0) # Normalising the aggregated data using the Z-score
norm_minmax_df = normalize(av_df,1) # Normalising the aggregated data using the minmax method

### 4. Differentiating between controls (healthy subjects) and patients

First, the appropriate package for K Neighbors classification has to be imported.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

1) Classification method is performed on **cleaned and aggregated** data.

In [None]:
y = av_df['stat'] #Labels in status train the classifier
X = av_df.drop(['id','stat'], axis=1) # drop all varibales that are not predicting variables

# Initialize the model with n = neighbors
knn = KNeighborsClassifier(n_neighbors=3)
## Fit the model on the observed data.
knn.fit(X, y)
## See how the model performs.
Acc = knn.score(X, y)
print ('The accuracy of the model is ' + str(Acc))

2) Now we will repeat the analysis by normalising the variables using **Z score**.

In [None]:
y = norm_z_df['stat']
X = norm_z_df.drop(['id','stat'], axis=1) # drop all varribales that are not predicting variables

# Initialize the model with n = neighbors
knn = KNeighborsClassifier(n_neighbors=3)
## Fit the model on the observed data.
knn.fit(X, y)
## See how the model performs.
Acc = knn.score(X, y)
print ('The accuracy of the model is ' + str(Acc))

3) Finally, cleaned, aggregated but normalized data using the **min-max** option.

In [None]:
y = norm_minmax_df['stat']
X = norm_minmax_df.drop(['id','stat'], axis=1) # drop all varribales that are not predicting variables

# Initialize the model with n = neighbors
knn = KNeighborsClassifier(n_neighbors=3)
## Fit the model on the observed data.
knn.fit(X, y)
## See how the model performs.
Acc = knn.score(X, y)
print ('The accuracy of the model is ' + str(Acc))

We do not seem to find differences of accuracy between the models, although it was expected that normalization would yield imporved accuracy. 