# Decision Tree in Machine Learning

## 1. Introduction to Decision Trees
A Decision Tree is a supervised learning algorithm used for both classification and regression tasks. It models decisions and their possible consequences in a tree-like structure, comprising nodes, branches, and leaves. Each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label or a continuous value.


![Alt Text](Decision_Tree.png)

## 2. Terminologies

- **Root Node**: The topmost node in a decision tree, representing the entire dataset.
- **Internal Nodes**: Nodes that represent decision points based on the attributes of the data.
- **Branches**: The connections between nodes, representing the outcome of a decision or test.
- **Leaf Nodes**: The terminal nodes that represent the final output or class label.
- **Splitting**: The process of dividing a node into two or more sub-nodes based on certain criteria.
- **Pruning**: The process of removing sub-nodes from a decision tree to reduce its complexity and improve generalization.
- **Impurity**: A measure of the disorder or uncertainty in a dataset (e.g., Gini impurity, entropy).

## 3. How Decision Tree is Formed?

### Step-by-Step Process

1. **Selecting the Best Attribute**: The process begins with selecting the best attribute to split the dataset. This selection is based on criteria such as Gini impurity, Information Gain (using entropy), or Mean Squared Error (for regression tasks).

2. **Splitting the Data**: Once the best attribute is selected, the data is split into subsets based on the attribute’s possible values.

3. **Recursive Partitioning**: This process of selecting the best attribute and splitting the data continues recursively for each subset until a stopping condition is met (e.g., all samples belong to the same class, a maximum tree depth is reached, or no further information gain can be achieved).

4. **Assigning Class Labels**: At the leaf nodes, a class label (for classification) or a continuous value (for regression) is assigned based on the majority class or average of the values, respectively.


## 4. Why Decision Tree?

- **Easy to Understand and Interpret**: Decision trees are simple to visualize and explain, making them accessible even to non-experts.
- **No Need for Data Normalization**: Decision trees do not require feature scaling or normalization.
- **Handles Both Numerical and Categorical Data**: Decision trees can work with various types of data without the need for extensive preprocessing.
- **Feature Importance**: They provide insights into the importance of different features in the dataset.

## 5. Advantages and Disadvantages of Decision Tree

### Advantages

- **Easy to Understand and Interpret**:  The tree structure is simple to visualize and explain.
- **Handles Both Numerical and Categorical Data**:  Decision trees can work with different types of data.
- **No Need for Data Normalization**:  Decision trees do not require feature scaling or normalization.
- **Feature Importance**:  Provides insights into the importance of different features.

### Disadvantages

- **Overfitting**: Decision trees can become overly complex and model noise in the data, leading to poor generalization.
- **Instability**:  Small changes in the data can result in a completely different tree structure.
- **Bias towards Certain Features**: Decision trees can be biased towards features with more levels.

## 6. Building a Decision Tree Classification Model: Wine Classification

![Alt Text](image_1.png)

### Step 1: Import Libraries

In [1]:
import pandas as pd

### Step 2: Load and Explore the Dataset

In [3]:
# Load the Wine dataset
df = pd.read_csv('Titanic-Dataset.csv')
# Display the first few rows of the dataset
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [5]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [6]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [7]:
# Drop columns that are not needed
df = df.drop(columns=['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin'])

In [8]:
# Handle missing values
df['Age'] = df['Age'].fillna(df['Age'].median())
df.isnull().sum()

Survived    0
Pclass      0
Sex         0
Age         0
Fare        0
Embarked    2
dtype: int64

In [9]:
df = df.dropna(subset=['Embarked'])
df.isnull().sum()

Survived    0
Pclass      0
Sex         0
Age         0
Fare        0
Embarked    0
dtype: int64

### Step 3: Visualize the Data

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(data=df, x='Survived')
plt.show()

In [None]:
sns.histplot(df['Age'].dropna(), kde=True)
plt.title('Age Distribution')
plt.show()

### Step 4: Preprocess the Data

In [None]:
from sklearn.preprocessing import LabelEncoder

# Convert categorical variables into numeric
label_encoder = LabelEncoder()
df['Sex'] = label_encoder.fit_transform(df['Sex'])
df['Embarked'] = label_encoder.fit_transform(df['Embarked'])
df.head()

In [None]:
# Split the data into features and target variable
X = df.drop('Survived', axis=1)
y = df['Survived']

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Step 5: Train the Decision Tree Model

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Create and train the Decision Tree classifier
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

In [None]:
# Make predictions on the test set
y_pred = model.predict(X_test)

### Step 6: Evaluate the Model

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

In [None]:
# Print classification report
print(classification_report(y_test, y_pred))

# Print confusion matrix
confusion = confusion_matrix(y_test, y_pred)
sns.heatmap(confusion, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

### Step 7: Visualize the Decision Tree

In [None]:
from sklearn.tree import plot_tree

# Plot the decision tree
plt.figure(figsize=(20,10))
plot_tree(model, feature_names=X.columns, class_names=['Not Survived', 'Survived'], filled=True)
plt.title('Decision Tree')
plt.show()


### Step 8: Example usage

In [None]:
def predict_survival(model, pclass, sex, age, fare, embarked):
    # Create a DataFrame for the input data
    example_data = pd.DataFrame({
        'Pclass': [pclass],
        'Sex': [sex],
        'Age': [age],
        'Fare': [fare],
        'Embarked': [embarked]
    })
    
    # Predict survival
    example_prediction = model.predict(example_data)
    
    # Return the result
    return 'Survived' if example_prediction[0] == 1 else 'Not Survived'


In [None]:
# Example usage of the function
result = predict_survival(
    model=model,
    pclass=3,
    sex=1,  # male
    age=22,
    fare=7.25,
    embarked=2  # S
)

print(f'Predicted survival: {result}')


### Conclusion

In this notebook, we performed digit classification using the Titanic dataset and a Decision Tree classifier. We explored the dataset, visualized the images, preprocessed the data, trained the model, evaluated its performance, and visualized the predictions. The Decision Tree classifier achieved a reasonable accuracy, and the confusion matrix helped us understand the model's performance across different digits.

This step-by-step guide provides a comprehensive introduction to Titanic dataset classification for beginners, covering all essential aspects from data exploration to model evaluation.
