# **`scikit-learn`** Decision Tree Model

The **`DecisionTreeClassifier`** in `scikit-learn` is a class within the `sklearn.tree` module that implements a decision tree algorithm for classification tasks.

It is used to build a model that predicts the class label of a target variable based on input features.

### How it works:

*  It constructs a tree-like model of decisions, where each internal node represents a test on an attribute, each branch represents an outcome of the test, and each leaf node represents a class label.
* The algorithm recursively splits the data based on features that best separate the classes, aiming to minimize impurity (e.g., using Gini impurity or entropy).

# Create Train Dataset and Test Dataset

## Load Dataset

We will use a dataset of movie review.

This dataset is used for binary sentiment classification.

The dataset contains two columns: "review" and "sentiment".

The values in the "sentiment" column can have one of two values: "positive" or "negative".

In [None]:
import pandas as pd

# csv file location
url = 'https://github.com/tariqzahratahdi/MachineLearning/raw/refs/heads/main/datasets/movies_reviews.csv'

# making dataframe from csv file
data = pd.read_csv(url)

# show dataframe
data

Unnamed: 0,review,sentiment
0,Interesting and short television movie describ...,negative
1,Insignificant and low-brained (haha!) 80's hor...,negative
2,"Ingrid Bergman, playing dentist Walter Matthau...",positive
3,Infamous horror films seldom measure up the hy...,negative
4,Independent film that would make Hollywood pro...,negative
...,...,...
1995,You remember the Spice Girls movie and how bad...,negative
1996,You should never ever even consider to watch t...,negative
1997,You wear only the best Italian suits from Arma...,positive
1998,You'd think you're in for some serious sightse...,positive


## Check Dataset is Balanced

Check that the dataset contains the same number of rows with the value "positive" than with "negative" in the column "setiment".

In [None]:
data.value_counts('sentiment')

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
negative,1000
positive,1000


## Split Data into Train and Test

### Import Libraries

In [None]:
# import sklearn train_test_split
from sklearn.model_selection import train_test_split

### Create Train and Test Dataframes


In [None]:
# create train and test dataframes
data_train, data_test = train_test_split(data, test_size=0.25, random_state=42)

# show train dataframe
data_train

Unnamed: 0,review,sentiment
1738,"This movie is very violent, yet exciting with ...",negative
548,Gilmore Girls is a hilarious show with never e...,positive
936,1 thing. this movie sucks BIG TIME..i was into...,negative
1389,The above line sums it up pretty good. The bes...,positive
1607,"This is a great ""small"" film. I say ""small"" be...",positive
...,...,...
1130,Must have to agree with the other reviewer. Th...,negative
1294,"Resnais, wow! The genius who brought us Hirosh...",negative
860,Absolutely nothing is redeeming about this tot...,negative
1459,"The movie with its single set, minimal cast, a...",positive


## Set Predictor Variable and Response Variable

In [None]:
# set predictor variable and response variable
X_train, y_train = data_train['review'], data_train['sentiment']
X_test, y_test = data_test['review'], data_test['sentiment']

# show predictor variable dataframe
X_train

Unnamed: 0,review
1738,"This movie is very violent, yet exciting with ..."
548,Gilmore Girls is a hilarious show with never e...
936,1 thing. this movie sucks BIG TIME..i was into...
1389,The above line sums it up pretty good. The bes...
1607,"This is a great ""small"" film. I say ""small"" be..."
...,...
1130,Must have to agree with the other reviewer. Th...
1294,"Resnais, wow! The genius who brought us Hirosh..."
860,Absolutely nothing is redeeming about this tot...
1459,"The movie with its single set, minimal cast, a..."


# Turn Text Data into Numerical Vectors

## Create an instance of `TfidfVectorizer`

### Import Library and Create an Instance of `TfidfVectorizer`

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# create an instance of TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')

### Transform Text into Sparse Matrix

We use the `tfidf.fit_transform()` method to create a model and tranform train data into a sparse matrix the first time.

After that, we use the `tfidf.transform()` method to tranform test data into a sparse matrix.

In [None]:
# transform train text into sparse matrix
X_train_vector = tfidf.fit_transform(X_train)

# transform test text into sparse matrix
X_test_vector = tfidf.transform(X_test)  # use transform() instead of fit_tranform()

X_train_vector

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 132712 stored elements and shape (1500, 21769)>

# The Classification Model **`DecisionTreeClassifier`**

The **`DecisionTreeClassifier`** in `scikit-learn` is a class within the `sklearn.tree module` that implements a decision tree algorithm for classification tasks.

It is used to build a model that predicts the class label of a target variable based on input features.

1. **How it works:**<br>
    *  It constructs a tree-like model of decisions, where each internal node represents a test on an attribute, each branch represents an outcome of the test, and each leaf node represents a class label.
    * The algorithm recursively splits the data based on features that best separate the classes, aiming to minimize impurity (e.g., using Gini impurity or entropy).
2. **Key Parameters:**<br>
    * `criterion`: Specifies the function to measure the quality of a split. Options are "gini" for Gini impurity (default) and "entropy" for information gain.
    * `splitter`: Determines the strategy used to choose the split at each node. Options are "best" (default) to choose the best split and "random" to choose the best random split.
    * `max_depth`: Sets the maximum depth of the tree, which can help prevent overfitting.
    * `min_samples_split`: The minimum number of samples required to split an internal node.
    * `min_samples_leaf`: The minimum number of samples required to be at a leaf node.
    * `max_features`: The number of features to consider when looking for the best split.
    * `ccp_alpha`: Complexity parameter for Minimal Cost-Complexity Pruning, used to control the size of the tree and prevent overfitting.
3. **Usage:**<br>
    * **Import:** Import the `DecisionTreeClassifier` from `sklearn.tree`.
    * **Instantiate:** Create an instance of the `DecisionTreeClassifier` with desired parameters.
    * **Train:** Fit the model to your training data using the `fit()` method.
    * **Predict:** Make predictions on new data using the `predict()` method.

## Create an Instance of `DecisionTreeClassifier`

In [None]:
# import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# create an instance of DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=5, random_state=42)

## Train the Model

In [None]:
# train the model
clf.fit(X_train_vector, y_train)

## Make Predictions

Once the model is trained, we can use it to make predictions with unseen data.

#### Example

Make prediction with unseen data:

In [None]:
# make prediction
clf.predict(tfidf.transform(['a good movie']))

array(['positive'], dtype=object)

In [None]:
# make prediction
clf.predict(tfidf.transform(['a bad movie']))

array(['negative'], dtype=object)

### Make Prediction with Test Data

Once the model is trained, we can use it to make predictions with the test data.

In [None]:
y_predict = clf.predict(tfidf.transform(X_test))

#### Show Actual Data Versus Predicted Data

In [None]:
# create a dataframe containing the predictions alongside the actual values
prediction = pd.DataFrame({'actual': y_test, 'predicted': y_predict})

print(prediction.head(10))

     actual predicted
0  negative  negative
1  positive  positive
2  negative  positive
3  positive  positive
4  negative  positive
5  negative  positive
6  positive  positive
7  negative  negative
8  negative  negative
9  negative  negative


#### Calculate Successful Predictions Ratio

In [None]:
# extract successful predictions
prediction_sucess = prediction[prediction['actual']
                                == prediction['predicted']]

# calculate successful predictions ratio
prediction_sucess.shape[0] / prediction.shape[0]

0.714