# DSS GM Lecture 2: Data Mining #
<i>Authors: Roshan Lodha & Kevin Chai</i>

In [None]:
#imports and styling
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set()

### What is data mining? ###
Wikipedia broadly defines data mining as <b>the process of discovering patterns in large data sets</b>. This can involve techniques that span many fields, and has obvious practical impact in things like artificial intelligence. More subtly, appropriate application of data mining can greatly improve tangential fields like healthcare, education, and others. 

Our goal for today is to is to better define data mining, and go over specific uses in real-world data sets.

### Section 0: Data Cleaning ###
Before we can search for patterns in our data, we need to ensure that it is clean in order to minimize future errors. Python (specifically pandas) has built in commands that make data cleaning extremely simple.

In [None]:
df = pd.read_csv('BL-Flickr-Images-Book.csv') #credit: realpython
df.head()

We can see that the dataframe is very "dirty." A lot of the information is seemingly unnecessary so we can drop them.

In [None]:
df = df.drop(columns=['Edition Statement','Corporate Author','Corporate Contributors',
         'Former owner','Engraver','Contributors','Issuance type','Shelfmarks'])

We can also see that the "Identifier" column in unique and use that as the pivot over an ordered list of numbers that does not tell us much.

In [None]:
df.set_index('Identifier')

In the "Date of Publication" column we can see that the values are not in a consistent format. We can use Regex to extract the useful information. The mechanism and rules for regular expressions are out of the scope of this section but will be covered in the a future lecture. 

In [None]:
DoP = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)
df['Date of Publication'] = pd.to_numeric(DoP)
df.head()

### Section 1: Clustering ###
We'll start by defining a randomly generated dataset using sklearn, and show how data mining can reveal its domains. 

In [None]:
#dataset creation
from sklearn.datasets.samples_generator import make_blobs

X, y_true = make_blobs(n_samples=100, centers=3, cluster_std=1)
plt.scatter(X[:, 0], X[:, 1]);

From the plot above, we can visually see that there are 3 clusters. While the blobs are easy to seperate in the current stae, extending this to multiple dimensions would render it impossible to distinguish them. That's where clustering algorithms come in; using math not in the scope of this notebook, they can reveal underlying patters in datasets. 

A very popular and easy to use clusting algorithm is k-means clustering, which computes k centroids in the dataset. 

Note: In practice, we do not know how many clusters truly define the dataset, so we try a range of values and selected the number that best balances number of clusters with loss.

In [None]:
from sklearn.cluster import KMeans

means = KMeans(n_clusters=3)
means.fit(X)
fitted = means.predict(X)

In [None]:
#visualizing the clusters; credit: jakevdp
plt.scatter(X[:, 0], X[:, 1], c=fitted, cmap='Accent')
centers = means.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], marker='*', c='red')

### Section 2: Regression ###
Another data mining technique you have likely heard of is regression. Put simply, regression is a powerful predictive technique that seeks to harness the clumps above into a predictive tool. While there are many different forms of regression (simple, linear, logistic, etc.) we will demo the Random Forest Regressor here.

Random Forest works by defining boundaries in a way that seperates clumps effectively while simultanouesly minimizing loss. If you want to read more about how it works and the loss it minimizes read sklearn's documentation.

In [None]:
#loading the dataset
from sklearn import datasets

iris = datasets.load_iris()
iris.data[0:5]

The numbers above make little sense so we can process them into a dataframe based on sklearn's documentation.

In [None]:
data=pd.DataFrame({
    'sepal length':iris.data[:,0],
    'sepal width':iris.data[:,1],
    'petal length':iris.data[:,2],
    'petal width':iris.data[:,3],
    'species':iris.target
})
data.head()

A key difference between clustering and regression is the need for training data in the latter. As the name implies, training data trains the model to predict an outcome given input features. While it is easy to manually split a dataframe, we will use sklearn's test-train-split to make our life easier.

In [None]:
from sklearn.model_selection import train_test_split

X=data[['sepal length', 'sepal width', 'petal length', 'petal width']]
y=data['species']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) 

We are now ready to train our Random Forest Classifier.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=500)
rf.fit(X_train, y_train)

pred = rf.predict(X_test)

Below we assess how well our model predicted the form of flower. 

In [None]:
from sklearn import metrics

print("Accuracy: ", metrics.accuracy_score(y_test, pred))