# Principal Component Analysis

To which family of dimensionality reduction algorithms does Principal Component Analysis belong?

- Principal Component Analysis is an example of linear dimensionality reduction algorithm.

# Visitors from outer space

So you concluded you must resort to dimensionality reduction because of very limited computational resources you have available for crunching your hyper-dimensional dataset.

And for the same reason, you feel that the PCA algorithm is the best choice due to its speed and simplicity.

Good. But did you check your data for outliers? Let's see how they could impact your results.

A 3-dimensional dataset of 1000 samples (X_raw), slightly "contaminated" with 5 outliers (X_new), has been pre-loaded, as seen on Figure 1.

On Figure 2 you see that the impact of these outliers (in red) is trivial and creates no problem in extracting actual principal components.

But what happens if they are further away?

<center><img src="images/03.031.svg"  style="width: 200px, height: 200px;"/></center>


In [1]:
# # Add outliers to the blob
# X_new, outliers = add_outliers(X_raw,
# 							   outlier_distance=200,
#                                n_outliers=5)

# plot_3d_data(X_new, outliers)

<center><img src="images/03.032.svg"  style="width: 200px, height: 200px;"/></center>


In [None]:
# # Add outliers to the blob
# X_new, outliers = add_outliers(X_raw, outlier_distance=200, n_outliers=5)

# plot_3d_data(X_new, outliers)  

# # Extract principal components
# X_2D, outliers_2D = extract_components(X_new, outliers, n_components=2)

# # Plot the PCA results
# plot_2d_data(X_2D, outliers_2D)

<center><img src="images/03.033.svg"  style="width: 200px, height: 200px;"/></center>


# Lucky number K

Beginners in Machine Learning often have very optimistic ideas that Machine Learning can produce amazing insights with little to no human involvement and decision making.

The truth is that the performance of your algorithms is heavily influenced by parameters that you as a human define before the model has seen any data.

In the case of clustering, most algorithms still require you to be explicit about the number of clusters you are looking for. But not all!

Which of the following clustering algorithms determines the number of clusters on its own?

- DBSCAN determines the number of clusters on its own.

# Elbow reading

Determining the right number of clusters is one of the most crucial steps in developing a clustering model.

In this exercise, you will apply K-means clustering and the "elbow method" to determine the correct number of clusters present in the dataset at hand.

The data is loaded in the variable X and you have been provided with two functions for your convenience, plot_clusters() and plot_elbow_curve(), to facilitate the discovery process.

Your task is to specify the range of numbers of clusters over which to scan in order to produce the "elbow curve".

<center><img src="images/03.061.svg"  style="width: 200px, height: 200px;"/></center>


In [2]:
# k_range = list(range(2, 6))

# summed_distances = []

# for k in k_range:    
#     kmeans.set_params(n_clusters=k).fit(X)
#     summed_distances.append(kmeans.inertia_)

# plot_elbow_curve(k_range, summed_distances)

<center><img src="images/03.062.svg"  style="width: 200px, height: 200px;"/></center>


- 4

# DBSCAN

DBSCAN is another very popular clustering algorithm, belonging to density-based algorithms.

For beginners it can seem very attractive because it doesn't require the number of clusters to be defined in advance.

But there's no free lunch and relying on DBSCAN to find the right number of clusters completely on its own can be a big trap.

Let's illustrate this by playing with DBSCAN's hyper-parameter eps, which defines the maximum distance between points within the same cluster.

In [3]:
# # Set eps to 0.1
# eps = 0.1

# dbscan.set_params(eps=eps)

# clusters = dbscan.fit_predict(X)

# plot_clusters(X, clusters)

<center><img src="images/03.071.svg"  style="width: 200px, height: 200px;"/></center>


In [4]:
# # Set eps to 0.5
# eps = 0.5

# dbscan.set_params(eps=eps)

# clusters = dbscan.fit_predict(X)

# plot_clusters(X, clusters)

<center><img src="images/03.072.svg"  style="width: 200px, height: 200px;"/></center>


In [5]:
# # Set eps to 2
# eps = 2

# dbscan.set_params(eps=eps)

# clusters = dbscan.fit_predict(X)

# plot_clusters(X, clusters)

<center><img src="images/03.073.svg"  style="width: 200px, height: 200px;"/></center>


# How unsupervised really?

When evaluating the performance of Anomaly Detection models, you most often use metrics from the domain of:
- Supervised learning

# The go-to algorithm

Despite being a bit more computationally intensive than other methods, an algorithm is commonly used for anomaly detection. Which algorithm is it?

- Isolation Forest is commonly used for anomaly detection.

# The odd one out

You saw previously that the IsolationForest() algorithm is a great first choice when in need of anomaly or outlier detection.

In this exercise you want to examine how the ratio of inliers to outliers (a.k.a. signal to noise ratio) affects its ability to detect anomalies.

The IsolationForest() algorithm has been loaded for you in the variable called isolation_forest, and a helper function make_fake_data() was loaded as well. Your task is to gradually increase the number of outliers and observe the difference in results in each iteration.

In [6]:
# # Generate data comprising of the "clean" and "noisy" components
# noisy_data, true_labels = make_fake_data(n_blobs=2, n_inliers=1000, n_outliers=50)

# # Detect anomalies
# predicted_anomalies = isolation_forest.fit_predict(noisy_data)
    
# # Plot results    
# plot_detected_anomalies(noisy_data, true_labels, predicted_anomalies)

<center><img src="images/03.111.svg"  style="width: 200px, height: 200px;"/></center>


In [7]:
# # Generate data comprising of the "clean" and "noisy" components
# noisy_data, true_labels = make_fake_data(n_blobs=2, n_inliers=1000, n_outliers=200)

# # Detect anomalies
# predicted_anomalies = isolation_forest.fit_predict(noisy_data)
    
# # Plot results    
# plot_detected_anomalies(noisy_data, true_labels, predicted_anomalies)

<center><img src="images/03.112.svg"  style="width: 200px, height: 200px;"/></center>


In [8]:
# # Generate data comprising of the "clean" and "noisy" components
# noisy_data, true_labels = make_fake_data(n_blobs=2, n_inliers=1000, n_outliers=500)

# # Detect anomalies
# predicted_anomalies = isolation_forest.fit_predict(noisy_data)
    
# # Plot results    
# plot_detected_anomalies(noisy_data, true_labels, predicted_anomalies)

<center><img src="images/03.113.svg"  style="width: 200px, height: 200px;"/></center>


# Elon's tweets

You will attempt the impossible: detecting patterns in Elon Musk's tweets!

You will apply two unsupervised learning algorithms:

Dimensionality reduction, to translate your text data into a 2D space.
Clustering, to find groups of similar tweets.
The go-to model for dimensionality reduction is Principal Component Analysis (PCA), while the KMeans algorithm represents the same in the domain of clustering.

Tweets in their raw form were loaded into the variable named tweets_raw.
They have also been translated into a machine-digestible, vectorized form, contained in the variable tweets_matrix.
To write less code, we want you to use the functions for combined fitting and transformation/prediction - .fit_transform() and .fit_predict()
Be aware that this is real data from Twitter and as such there is always a risk that it may contain profanity or other offensive content (in this exercise, and any following exercises that also use real Twitter data).

In [9]:
# # Set the number of dimensions to 2
# dimensionality_reducer = PCA(n_components=2)

# # Apply dimensionality reduction
# tweets_reduced = dimensionality_reducer.fit_transform(tweets_matrix)

# # Configure the clustering model
# clustering_model = KMeans(n_clusters=2)

# # Find clusters
# tweet_clusters = clustering_model.fit_predict(tweets_reduced)

# # Show the clustering results
# print_cluster_tweets(tweet_clusters, tweets_raw)

<center><img src="images/03.12.svg"  style="width: 200px, height: 200px;"/></center>


# Fruits of knowledge

When you want to know why your model has made a certain decision for a specific single record, you are engaging in so-called "local model interpretation".

Currently, the most popular algorithm for this purpose has a very "fruity" acronym. Which one is it?

- LIME

# Predicting customer churn

Congratulations! You have just been hired as a Junior Data Scientist for a big telecommunications company.

On your first day, you are asked to help with a big problem the company is struggling with: predicting customer churn.

Luckily for you, your colleague has already prepared the dataset for you. You just need to use it to train a predictive model and determine its performance.

The exercise datasets have been loaded for your convenience:

- client_data holds the inputs (gender, tenure, monthly costs, number of dependents, etc)
- client_churned holds information on whether this client churned or not ('Yes', 'No')


As for the models, you have the RandomForestClassifier and LinearRegression at your disposal -- choose wisely!

In [10]:
# # Define your model
# model = RandomForestClassifier()

# # Divide the data into the training and testing set and train the model
# X_train, X_test, y_train, y_test = train_test_split(client_data, client_churned)
# model.fit(X_train, y_train)

# # Generate predictions on the test set
# predictions_test = model.predict(X_test)

# # Evaluate the model predictions using metrics appropriate for this problem class
# print_metrics(target_test=y_test,
#               predictions=predictions_test, 
#               metrics=['accuracy', 'precision', 'recall'])