# Advanced Topics in Data Science (CS5661). Cal State Univ. LA, CS Dept.
### Instructor: Dr. Mohammad Porhomayoun
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------


# Data Science in Python

#### This is a review of data sceince libraries/packages in python. Feel free to refer to the suggested resources and documentaries for more details.

---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------


# Scikit-Learn Library (sklearn):
Scikit-learn is the Python Machine Learning Library. It includes optimal implementation of various classification, regression and clustering algorithms. It also includes hundreds of commands and functions for data preprocessing and processing along with a number of default datasets to work with.


## The Main Steps to build (train) and use (test/predict) a predictive model in sklearn:

## Step1: Importing the sklearn class (machine learning algorithm) that you would like to use for modeling:

In [None]:
# The following line will import DecisionTreeClassifier "Class"
# DecisionTreeClassifier is name of a "sklearn class" to perform "Decision Tree Classification" 

from sklearn.tree import DecisionTreeClassifier

In [None]:
# Importing the required packages and libraries
# we will need numpy and pandas later
import numpy as np
import pandas as pd


## Step2: Set up the Feature Matrix and Label Vector:

## Let's start with iris data as a popular and simple dataset:


In [None]:
# reading a CSV file directly from Web, and store it in a pandas DataFrame:
# "read_csv" is a pandas function to read csv files from web or local device:

iris_df = pd.read_csv('https://raw.githubusercontent.com/mpourhoma/CS5661/master/iris.csv')


In [None]:
# checking the dataset by printing every 10 lines:
iris_df[0::10]

In [None]:
# Defining a function to convert "categorical" labels to "numerical" labels (Optional)
# Notice that the latest version of Scikit-Learn can also handdle categorical labels. So, this step is optional.

def categorical_to_numeric(x):
    if x == 'setosa':
        return 0
    elif x == 'versicolor':
        return 1
    elif x == 'virginica':
        return 2
    
# Applying the function on species column and adding corrsponding label column:
iris_df['label'] = iris_df['species'].apply(categorical_to_numeric)

# checking the dataset by printing every 10 lines:
iris_df[0::10]

In [None]:
# Creating the Feature Matrix for iris dataset:

# create a python list of feature names that would like to pick from the dataset:
feature_cols = ['sepal_length','sepal_width','petal_length','petal_width']

# use the above list to select the features from the original DataFrame
X = iris_df[feature_cols]  

# print the first 5 rows
X

In [None]:
# select a Series of labels (the last column) from the DataFrame
y = iris_df['species']  # or: iris_df['label']

# checking the label vector by printing every 10 values
y[::10]

## Step3: Defining (instantiating) an "object" from the sklearn class:

In [None]:
# In the following line, "my_decisiontree" is instantiated as an "object" of DecisionTreeClassifier "class". 

my_decisiontree = DecisionTreeClassifier()


## Step4: Traning Stage: Traning a predictive model using the training dataset:
#### Traning Stage called Fitting in sklearn
#### Method "fit" is used for many sklearn classes

In [None]:
# We can use the method "fit" of the "object my_decisiontree" along with training dataset and labels to train the model.

my_decisiontree.fit(X, y)

## Step5: Testing (Prediction) Stage: Making prediction on new observations (Testing Data) using the trained model:
##### Now, Suppose that we have a new observation (a new data sample) with Known features [6, 3, 5.9, 2.9], and Unknown label. What would be our predition for the label of this new observation?
#### Testing Stage is called Predict in sklearn
#### Method "predict" is used for many sklearn classes

In [None]:
# We can use the method "predict" of the *trained* object my_decisiontree on one or more testing data sample to perform prediction:

X_Testing = [[6, 3, 5.9, 2.9]]

y_predict = my_decisiontree.predict(X_Testing)

print(y_predict)

In [None]:
# We can use the method "predict" of the *trained* object knn on one or more testing data sample to perform prediction:
# Two new data samples:

X_Testing = [[6, 3, 5.9, 2.9],[3.2, 3, 1.9, 0.3]]

y_predict = my_decisiontree.predict(X_Testing)

print(y_predict)

# Evaluating the accuracy of our classifier:

#### 1- Let's split the iris dataset RANDOMLY into two new datasets: Training Set (e.g. 70% of the dataset) and Testing Set (30% of the dataset).
#### 2- Let's pretend that we do NOT know the label of the Testing Set!
#### 3- Let's Train the model on only Training Set, and then Predict on the Testing Set!
#### 4- After prediction, we can compare the "predicted labels" for the Testing Set with its "actual labels" to evaluate the accuracy of our Decision Tree Classifier!

#### We will learn more about model and accuracy evaluation in future tutorials!

In [None]:
# Randomly splitting the original dataset into training set and testing set
# The function"train_test_split" from "sklearn.cross_validation" library performs random splitting.
# "test_size=0.3" means that pick 30% of data samples for testing set, and the rest (70%) for training set.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [None]:
# print the size of the traning set:
print(X_train.shape)
print(y_train.shape)


In [None]:
# print the size of the testing set:
print(X_test.shape)
print(y_test.shape)


In [None]:
print(X_test)
print('\n')
print(y_test)

### Training ONLY on the training set:

In [None]:
# Training ONLY on the training set:

my_decisiontree.fit(X_train, y_train)


### Testing on the testing set:

In [None]:
# Testing on the testing set:

y_predict = my_decisiontree.predict(X_test)

print(y_predict)

# Accuracy Evaluation:
#### After prediction, we can now compare the "predicted labels" for the Testing Set with its "actual labels" to evaluate the accuracy of our KNN Classifier!

In [None]:
# Function "accuracy_score" from "sklearn.metrics" will perform element-to-element comparision and returns the 
# percent of correct predictions:

from sklearn.metrics import accuracy_score

# Example:
y_pred    = [0, 2, 1, 1]
y_actual  = [0, 1, 2, 1]

score = accuracy_score(y_actual, y_pred)

print(score)

In [None]:
# We can now compare the "predicted labels" for the Testing Set with its "actual labels" to evaluate the accuracy 
# Function "accuracy_score" from "sklearn.metrics" will perform the element-to-element comparision and returns the 
# portion of correct predictions:

from sklearn.metrics import accuracy_score

score = accuracy_score(y_test, y_predict)

print(score)

### checking the results:

In [None]:
results = pd.DataFrame()

results['actual'] = y_test 
results['prediction'] = y_predict 

print(results)

In [None]:
# How about using only two feature rather than all 4 for classification?
# Try this:
# feature_cols = ['sepal_length','sepal_width']
