# Motivation

I am currently learning some data science with python. I am following this course in udemy:

> [Python for Data Science and Machine Learning Bootcamp](https://www.udemy.com/course/python-for-data-science-and-machine-learning-bootcamp/)


In this notebook I want to practice some of the concepts and skills that I have learned so far:

- Some general exploratory data analysis (EDA)
- Principal Component Analysis
- Logistic Regressin
- Grid search cross validation (CV)

## Some background

I am an amateur cyclist and occasional runner who uses *STRAVA* every time I get on my bike or put my trainers on. In case you don't know it, [*STRAVA*](https://www.strava.com/) (which is Swedish for "strive") is an app that allows you to monitor and register your trainings and then share them with other users in a social network style. *STRAVA* is mostly used by cyclist and running folk.

Every time you register an activity it keeps track of data like distance, moving time, elevation gain, average speed, heart rate, and so on. Thanks to their API, it is relatively easy to obtain your activities data, provided you are a registered user of course. Explainining how to get all your activities data is out of the scope of this notebook (although I plan on uploading a notebook about it in the future), however, if you are curious about that, check the following links

- [1 - Intro and accessing Strava API with Postman - Strava API for Beginners](https://www.youtube.com/watch?v=sgscChKfGyg)

- [3 - Using Strava API with Python - Strava API for Beginners
](https://www.youtube.com/watch?v=2FPNb1XECGs)

- [Strava API v3 reference](https://developers.strava.com/docs/reference/)

## The task

I will provide a dataset of my activities and the objective will be to train an logistic regression model
that works only with a few principal components that is capable to predict if an activity was of type 'Ride' or type 'Run'. The task may not be very challenging, but I think is a good exercise for a beginner working in a field that is of my interest. Additionaly I want to perform a grid search CV to optimize the number of components and then use my model to perform predictions with other peoples strava datasets.

# Start by importing the modules I will use

In [None]:
import pandas as pd
import seaborn as sb
from matplotlib import pyplot as plt
import numpy as np

# Load the data

the data of my activities is on the csv file called ```activities_Miguel__Training-Validation.csv``` 

In [None]:
activities = pd.read_csv('../input/strava-ride-or-run-classification/activities_Miguel__Training-Validation.csv',index_col=0)
print(activities.info())
activities.head()

## Meaning of the columns

I grouped some of the columns in two groups: performance_features and social_features.

#### Performance features

Distances and heigths are in meters, while time is on seconds (it is called international system of units)

- distance: distance of the activity.
- moving_time: time where you are actually doing something (coffee stops don't count).
- elapsed_time: total time STRAVA is registering.
- total_elevation_gain: positive amount of climbing.
- achievement_count: number of time you get a 3rd, 2nd or personal record in a segment.
- pr_count: number of personal records.
- max_speed: in meters/s. I will not use average_speed because it is just distance/moving_time.
- elev_high: maximumn height reached.
- elev_low: minimum


#### social features
- kudos_count: A kudo is the name in strava for the likes given by friends.
- comment_count: number of comments.
- athlete_count: if this was a group activity then it is the number of companions in the activity. If it is a solo activiy then it is 0.
- total_photo_count: Total number of photos uploaded in an activity.

Then there is the column that I will use for classification: **type**. **type** stands for the sport of the activity: Ride, Run, Walk, Hike, Surf,.... Since I am only interested in classifying Rides or Runs I will just keep those two labels



In [None]:
social_features = [
     'kudos_count',
     'comment_count',
     'athlete_count',
     'total_photo_count',
]

performance_features = [
     'distance',
     'moving_time',
     'elapsed_time',
     'max_speed',
     'total_elevation_gain',
     'elev_high',
     'elev_low',
     'achievement_count',
     'pr_count',
]

# Some exploratory data analysis of my strava activities 

Keep only rides and run and since average_speed = distance / moving_time, I will also drop this column because it does not add more information

In [None]:
activities.drop(
    activities[
        (activities['type']!='Run') &
        (activities['type']!='Ride')
    ].index,
    inplace = True
)

activities.drop('average_speed',axis=1,inplace=True)

## Start with some pair plots of the data set
### Pair plot of performance data

In [None]:
performance_pair_plot = sb.pairplot(
    activities,
    vars = performance_features,
    hue = 'type',
    diag_kind = 'hist',
    palette = 'colorblind'
)

### Pair plot of social data

In [None]:
performance_pair_plot = sb.pairplot(
    activities,
    vars = social_features,
    hue = 'type',
    diag_kind = 'hist',
    palette = 'colorblind'
)

# Principal Component Analysis

I will start by defining a data frame for the performance data

In [None]:
activities_performance = activities[performance_features]
activities_performance.head()

Then I will scale the dataframe of performance to 0 mean and unit standard deviation

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_performance = scaler.fit_transform(activities_performance)

Make a PCA decomposition fit and transform the data

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca = PCA(n_components=9)
performance_pca = pca.fit_transform(scaled_performance)

Let's see what the data looks like after being transformed to principal components

In [None]:
plt.figure(figsize=(8,6));
sb.scatterplot(x = performance_pca[:,0],y = performance_pca[:,1],hue=activities['type'],cmap='plasma');
plt.xlabel('First principal component');
plt.ylabel('Second Principal Component');

The value of the first PC is a good indicator of the type of activity

### Plot of the ratio of the components to the explained variance

In [None]:
plt.figure(figsize=(12,6))
sb.barplot(
    y = pca.explained_variance_ratio_,
    x = [f'Component {n+1}' for n in range(pca.n_components) ]
)
plt.ylabel('Ratio of explained variance');

It looks like we can describe the 70% variance with the first 3 components

### Relation of the components to the original data

To interpret the principal components, it is useful to see how the different features contribute to each of them. This information is in the loadings of the principal components. I will make a dataframe with the loadings of the PCs

In [None]:
components_dataframe = pd.DataFrame(
    pca.components_,
    columns = performance_features,
    index = [f'Component {n+1}' for n in range(pca.n_components_) ]
)
components_dataframe

A heat map will help to see the contribution of each feature in the components. I make a heat map showing the absolute value of the loadings and a + or - sign, indicating if the component value increases or decreases withe the corresponding feature

In [None]:
ann = components_dataframe.applymap(
    lambda x:{-1:'-',1:'+'}.get(np.sign(x).astype('int'),'0')
)

plt.figure(figsize=(15,10))
sb.heatmap(
    components_dataframe.applymap(np.abs),
    annot = ann,
    fmt = '',
    cmap = 'rocket'
);
ax = plt.gca()
ax.tick_params(axis='x', labelrotation=60)
for label in ax.xaxis.get_ticklabels():
    label.set_horizontalalignment('right')

distance, moving time and elevation gain are of great importance for the first component, whihc explains almost the 50% of the variance

In [None]:
sb.scatterplot(y = activities_performance['distance'],x = performance_pca[:,0],hue=activities['type']);

# Logistic regression with principal components


I will train a logistic classifier that classifies the activities according to their type: Run and Ride. However,
I want a model that classifies the activities according to the values of some of the first principal components,
instead of using all the features as input directly. In order to do this the data needs to be scaled, then transformed to
the principal components and finally inserted in the logistic classifier.

The easiest way of doing this in sklearn is by using a pipeline, i.e., a concatenation of different transformers and an estimator at the end.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Define a pipeline to search for the best PCA truncation
# The concatenated steps in the pipeline are : scale->pca transform -> logistic classifier
pipe = Pipeline(steps=[('scaler',StandardScaler()),('pca', PCA()), ('logistic', LogisticRegression(max_iter=10000, tol=0.1))])

## As usual, split my data in train and test sets. X will be the performance data and y the type of the activity

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    activities_performance,
#     pd.get_dummies(activities['type'])['Ride'],
    activities['type'],
    test_size=0.25,
    random_state=42
)

### Distribution of classes in the train and test sets

In [None]:
fig,[ax1,ax2] = plt.subplots(ncols=2,figsize = (10,5),sharey=True)
sb.countplot(x = y_train,ax = ax1)
ax1.set_title('Training')
sb.countplot(x = y_test,ax = ax2)
ax2.set_title('Test');

The two classes are not very well balanced. I need to run more often.

## Fit the pipeline

In [None]:
pipe.fit( X_train, y_train );

## Make some predictions

In [None]:
y_predictions = pipe.predict(X_test)

## Evaluate the performance of the model

In [None]:
from sklearn.metrics import plot_confusion_matrix, confusion_matrix, classification_report

### Plot of the confusion matrix

In [None]:
plt.figure(figsize=(6,6))
plot_confusion_matrix(pipe,X_test,y_test,ax=plt.gca());

### Classification report

In [None]:
print(
    classification_report(y_test,y_predictions)
)

### Visualize the predictions for two selected features and the 2 first principal components

In [None]:
fig,[[ax1,ax2],[ax3,ax4]] = plt.subplots(ncols=2,nrows=2,figsize = (15,15),sharey='row',sharex='row')

# Scatter plots for distance and elevation gain 

#Scatter plot for true classes
sb.scatterplot(
    x = X_test['distance'],
    y = X_test['total_elevation_gain'],
    hue = y_test,
    ax = ax1
)
ax1.set_title('True classes');

#Scatter plot for predicted classes
sb.scatterplot(
    x = X_test['distance'],
    y = X_test['total_elevation_gain'],
    hue = y_predictions,
    ax = ax2
)
ax2.set_title('Predicted');

# Scatter plots for 2 first PC
X_test_transformed = pipe[:-1].transform(X_test) #I am applying the pipe line until the pc step

#Scatter plot for true classes
sb.scatterplot(
    x = X_test_transformed[:,0],
    y = X_test_transformed[:,1],
    hue = y_test,
    ax = ax3
)
ax3.set_title('True classes');
ax3.set_xlabel('First PC')
ax3.set_ylabel('Second PC')

#Scatter plot for predicted classes
sb.scatterplot(
    x = X_test_transformed[:,0],
    y = X_test_transformed[:,1],
    hue = y_predictions,
    ax = ax4
)
ax4.set_title('Predicted');
ax4.set_xlabel('First PC')

fig.suptitle('Compare predictions with true classes');

As can be seen, the model performs reasonably well. Do not stop there and see if it can be improved by selecting the optimal number of components. I will do this with a gridsearch CV

# Grid Search CV

In [None]:
from sklearn.model_selection import GridSearchCV

## Build the grid estimator.

A neat feature of pipelines is that they can crosvalidated for different hyperparameters in a really simple way. You just have to introduce the values of the parameters for different steps of the pipeline with the following convention

```python
param_grid = {
    '<step_name>_<parameter_name>':'<list of values>',
    ...
}
```

In [None]:
param_grid = {'pca__n_components':[1,2,3,4,5,6,7,8,9]}
grid = GridSearchCV(
    pipe,
    param_grid = param_grid,
    verbose = 3,
    scoring = 'f1_weighted'
)

## Fit the grid estimator

In [None]:
grid.fit(X_train,y_train);

## Results of the grid search

### Best hyperparameters ( number of PCs)

In [None]:
print(
    '\n'.join(   f'{ind}: {val}' for ind,val in grid.best_params_.items() ) 
)

### Complete results in a dataframe

In [None]:
results = pd.DataFrame(grid.cv_results_)
results

### Plot of the score as a function of the number of components

In [None]:
fig, (ax0,ax1) = plt.subplots(nrows=2,sharex=True,figsize=(6,8))

# Plot of the explained variance ratio of every component
ax0.plot(
    np.arange(1, pipe['pca'].n_components_ + 1),
    pipe['pca'].explained_variance_ratio_,
    'k-o',
    linewidth=2
)
ax0.axvline(grid.best_params_['pca__n_components'],ls='--',c='k',label = 'Chosen number\n of components')

ax0.legend(prop=dict(size=12))
ax0.set_ylabel('PCA explained variance ratio')

# Plot of the score as a function of the number of components
results.plot(
    x = 'param_pca__n_components',
    y = 'mean_test_score',
    yerr = 'std_test_score',
    style='-o',
    c = 'k',
    capsize=4,
    ax = ax1,
    legend = False
)
ax1.set_ylabel('classes weighted average of f1 score')


ax1.set_xlabel('n components');

The optimal number of principal components seem to be 4. 5, 6, and 9 give also nice results but 4 is simpler to evaluate.

## Test the optimal estimator with the test dataset

In [None]:
plt.figure(figsize=(6,6))
plot_confusion_matrix(grid,X_test,y_test,ax=plt.gca());

# Test the model with external data

In [None]:
test_data = pd.read_csv('../input/strava-data/strava_full_data.csv')

In [None]:
test_data.columns