## Theoretical

### 1. For the following confusion matrix, answer the following:
a. The size of the dataset
- 900
    
b. Is it an imbalanced dataset?
- Yes it is imbalanced, because the amount of values classified as "No" are much higher than those classified as "Yes".
    
c. Compute the accuracy, precision, recall and F-measure
- accuracy = (100+700)/(100+0+100+700) = 0.89
- precision = 100/(100+100) = 0.5
- recall = 100/(100+0) = 1
- F-measure = (2x100)/(2x100)+0+100 = 0.67
        
![Consusion Matrix](https://i.imgur.com/WpmPWFK.png)

### 2. What is overfitting? Why do some models overfit?

Overfitting is when a model is too complex, the training error is small but the test error is large. 

Reasons for overfittng could be limited training data size or high model complexity.

### 3. Describe two methods for model evaluation

- Holdout
    - Reserve `k%` for training and `(100-k)%` for testing 
    - Random subsampling: repeated holdout
- Cross validation
    - Partition data into `k` disjoint subsets
    - k-fold: train on `k-1` partitions, test on the remaining one
    - Leave-one-out: `k=n`

### 4. Describe three clustering algorithms

1. K-means and its variants
    - Partitional clustering approach 
    - Number of clusters, K, must be specified
    - Each cluster is associated with a centroid (center point) 
    - Each point is assigned to the cluster with the closest centroid
    - The basic algorithm is very simple
2. Hierarchical clustering
    - Produces a set of nested clusters organized as a hierarchical tree
    - Can be visualized as a dendrogram (A tree like diagram that records the sequences of merges or splits)
    - Two main types: Agglomerative and Divisive
    - Traditional hierarchical algorithms use a similarity or distance matrix (Merge or split one cluster at a time)
3. Density-based clustering
    - Density = number of points within a specified radius (Eps)
    - A point is a core point if it has at least a specified number of points (MinPts) within Eps 
        - These are points that are at the interior of a cluster
        - Counts the point itself
    - A border point is not a core point, but is in the neighborhood of a core point
    - A noise point is any point that is not a core point or a border point

### 5. What are the cases when DBSCAN does not work well?

DBSCAN does not work well when there are:
- Varying densities
- High-dimensional data

## Practical

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import cluster
from sklearn.feature_extraction.text import TfidfVectorizer

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

import re
import string

### 1. Use the BOW model on the text of the BBC dataset (apply data preprocessing) and do K-means clustering (don't use the category feature). See if the clusters correspond to the categories. (Choose K to be the number of categories)

**Note:** Ask Dr what does he mean by see if clusters correspond to categories

In [None]:
bbc_df = pd.read_csv("../input/bbc-fulltext-and-category/bbc-text.csv")
bbc_df.head()

Removing any punctuations

In [None]:
punct = "\n\r"+string.punctuation

def noise_removal(value):
    return value.translate(str.maketrans('', '', punct))

In [None]:
text_col = bbc_df['text'].apply(noise_removal)
text_col.head()

Removing stop words, applying stemming and lemmatization, then doing TF-IDF Term Weighting.

In [None]:
def tokenize(str_input):
    words = re.sub(r"(?u)[^A-Za-z]", " ", str_input).lower().split(" ")
    words = [stemmer.stem(word) for word in words if len(word)>2]
    words = [wordnet_lemmatizer.lemmatize(word) for word in words if len(word)>2]
    return words

In [None]:
stemmer = PorterStemmer()
wordnet_lemmatizer = WordNetLemmatizer()

vectorizer = TfidfVectorizer(tokenizer=tokenize, stop_words='english')

vectors = vectorizer.fit_transform(text_col)

feature_names = vectorizer.get_feature_names()

In [None]:
print("number of words = ", len(feature_names))

In [None]:
text_tfidf = pd.DataFrame(vectors.toarray(),columns=feature_names)
text_tfidf.head()

In [None]:
text_tfidf.shape

Checking the number of categories and using that as the K value for the K means cluster

In [None]:
bbc_df['category'].value_counts()

In [None]:
k_means = cluster.KMeans(n_clusters=5, max_iter=50, random_state=1)
k_means.fit(text_tfidf) 
labels = k_means.labels_

In [None]:
cluster_result_df = pd.DataFrame(labels, index=bbc_df['category'], columns=['Cluster ID'])
cluster_result_df.head()

Checking to see category to cluster correspondence

In [None]:
cluster_result_df.loc['tech'].value_counts()

The tech category values are mainly found in cluster 4

In [None]:
cluster_result_df.loc['business'].value_counts()

The business category values are mainly found in cluster 0

In [None]:
cluster_result_df.loc['sport'].value_counts()

The sport category values are mainly found in cluster 1

In [None]:
cluster_result_df.loc['entertainment'].value_counts()

The entertainment category values are mainly found in clusters 0 and 2

In [None]:
cluster_result_df.loc['politics'].value_counts()

The politics category values are mainly found in clusters 3 and 0

### 2. Apply K-means clustering to the customer segmentation dataset. See what does the clusters correspond to.

In [None]:
mc_df = pd.read_csv("../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv")
mc_df.head()

Checking the summary stats for the numerical columns

In [None]:
data = mc_df.copy()[['Age', 'Annual Income (k$)','Spending Score (1-100)']]
data.describe(include='all')

Setting different categories for the income and spending values so that we can compare the cluster IDs to the spending and income patterns of the customer segmentation

In [None]:
def income_categories(value):
    if value >= 100:
        return "Very High"
    elif value >= 75:
        return "High"
    elif value >= 50:
        return "Medium"
    else:
        return "Low"

def spending_categories(value):
    if value >= 75:
        return "Very High"
    elif value >= 50:
        return "High"
    elif value >= 25:
        return "Medium"
    else:
        return "Low"

In [None]:
income_cats = mc_df['Annual Income (k$)'].apply(income_categories)
spending_cats = mc_df['Spending Score (1-100)'].apply(spending_categories)

In [None]:
k_means = cluster.KMeans(n_clusters=4, max_iter=50, random_state=1)
k_means.fit(data) 
labels = k_means.labels_

In [None]:
income_res_df = pd.DataFrame(labels, index=income_cats, columns=['Cluster ID'])
income_res_df.head()

In [None]:
income_res_df.loc['Very High'].value_counts()

In [None]:
income_res_df.loc['High'].value_counts()

In [None]:
income_res_df.loc['Medium'].value_counts()

In [None]:
income_res_df.loc['Low'].value_counts()

- People with very high income and high income are found only in clusters 2 and 3
- People with medium income are found mainly in cluster 1, with a few in clusters 2 and 3
- People with low income are found in clusters 1 and 0

In [None]:
spending_res_df = pd.DataFrame(labels, index=spending_cats, columns=['Cluster ID'])
spending_res_df.head()

In [None]:
spending_res_df.loc['Very High'].value_counts()

In [None]:
spending_res_df.loc['High'].value_counts()

In [None]:
spending_res_df.loc['Medium'].value_counts()

In [None]:
spending_res_df.loc['Low'].value_counts()

- People with very high spending are found in clusters 2 and 0
- People with high spending are found in clusters 2, 1, and 0. But the majority in cluster 1.
- People with medium spending are found in clusters 3, 1, and 0. But the majority in cluster 1.
- People with low spending are found in clusters 3 and 1. But the majority are in cluster 3.