<a href="https://colab.research.google.com/github/salianbharat/BE_DS_Assignment/blob/main/Naive_Bayes_Credit_Score.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [20]:
# For data manipulation and analysis
import pandas as pd

# For preprocessing data, including one-hot encoding, label encoding, and binning
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, KBinsDiscretizer

# For splitting the dataset into training and testing sets
from sklearn.model_selection import train_test_split

# For building the Naive Bayes model
from sklearn.naive_bayes import GaussianNB

# For evaluating the model's performance
from sklearn.metrics import accuracy_score, classification_report


# Step 2: Loading and Exploring the Dataset

In [3]:
cs_original = pd.read_csv("/content/Credit Score Classification Dataset.csv")

# Display basic information about the dataset

In [4]:
cs_original.head()

Unnamed: 0,Age,Gender,Income,Education,Marital Status,Number of Children,Home Ownership,Credit Score
0,25,Female,50000,Bachelor's Degree,Single,0,Rented,High
1,30,Male,100000,Master's Degree,Married,2,Owned,High
2,35,Female,75000,Doctorate,Married,1,Owned,High
3,40,Male,125000,High School Diploma,Single,0,Owned,High
4,45,Female,100000,Bachelor's Degree,Married,3,Owned,High


information About the Dataset

This dataset contains information about a sample of over 100 people across the world. The data includes the following information:

1.Age
2.Gender
3.Income
Education
Marital Status
Number of Children
Home Ownership
Credit Score
Usability


# Use .info() to display a summary of the DataFrame.

In [21]:
print("Dataset Info:\n")
cs_original.info()

Dataset Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 164 entries, 0 to 163
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Age                 164 non-null    int64 
 1   Gender              164 non-null    object
 2   Income              164 non-null    int64 
 3   Education           164 non-null    object
 4   Marital Status      164 non-null    object
 5   Number of Children  164 non-null    int64 
 6   Home Ownership      164 non-null    object
 7   Credit Score        164 non-null    object
dtypes: int64(3), object(5)
memory usage: 10.4+ KB


Dataset is having non null values and having 3 integer feature and 5 object types feature.

In [6]:
# Display number of rows, number of columns.
cs_original.shape

(164, 8)

In [7]:
# Display all column names.
cs_original.columns

Index(['Age', 'Gender', 'Income', 'Education', 'Marital Status',
       'Number of Children', 'Home Ownership', 'Credit Score'],
      dtype='object')

In [8]:
# Display the number of missing values in each column.
# Check whether each value is missing.
#Aggregate the number of missing values per column.
cs_original.isnull().sum()

Age                   0
Gender                0
Income                0
Education             0
Marital Status        0
Number of Children    0
Home Ownership        0
Credit Score          0
dtype: int64

In [22]:
print("\nDescriptive Statistics:\n", cs_original.describe(include='all'))


Descriptive Statistics:
                Age  Gender         Income          Education Marital Status  \
count   164.000000     164     164.000000                164            164   
unique         NaN       2            NaN                  5              2   
top            NaN  Female            NaN  Bachelor's Degree        Married   
freq           NaN      86            NaN                 42             87   
mean     37.975610     NaN   83765.243902                NaN            NaN   
std       8.477289     NaN   32457.306728                NaN            NaN   
min      25.000000     NaN   25000.000000                NaN            NaN   
25%      30.750000     NaN   57500.000000                NaN            NaN   
50%      37.000000     NaN   83750.000000                NaN            NaN   
75%      45.000000     NaN  105000.000000                NaN            NaN   
max      53.000000     NaN  162500.000000                NaN            NaN   

        Number of Childre

# The conclusion of the descriptive statistics for this dataset is as follows:

**Age:**

The mean age is approximately 38 years, with a standard deviation of about 8.5 years.

The youngest individual is 25 years old, and the oldest is 53 years old.

**Gender:**

There are two unique genders in the dataset: Male and Female.
The frequency of females is slightly higher, with 86 occurrences out of 164.

**Income:**

The mean income is approximately $83,765, with a standard deviation of about $32,457.
Incomes range from $25,000 to $162,500.

**Education:**

There are five unique levels of education: High School Diploma, Associate's Degree, Bachelor's Degree, Master's Degree, and Doctorate.
The most common education level is Bachelor's Degree, occurring 42 times out of 164.

**Marital Status:**

There are two unique marital statuses: Single and Married.
The most common status is Married, with 87 occurrences out of 164.

**Number of Children:**

The average number of children is approximately 0.65, with a standard deviation of about 0.88.
The range of the number of children is from 0 to 3.

**Home Ownership:**

There are two unique categories for home ownership: Rented and Owned.
The most common category is Owned, occurring 111 times out of 164.

**Credit Score:**

There are three unique categories for credit scores: Low, Medium, and High.
The most common credit score category is High, occurring 113 times out of 164.

***Verifying Unique Values of Categorical Variable in the dataset***

In [10]:
cs_original["Home Ownership"].unique()

array(['Rented', 'Owned'], dtype=object)

In [11]:
# Print unique values of Credit Score col
cs_original["Credit Score"].unique()

array(['High', 'Average', 'Low'], dtype=object)

In [12]:
cs_original['Education'].unique()

array(["Bachelor's Degree", "Master's Degree", 'Doctorate',
       'High School Diploma', "Associate's Degree"], dtype=object)

In [19]:
cs_original['Marital Status'].unique()

array(['Single', 'Married'], dtype=object)

# **Preparing the dataset**

In [24]:
# There is no missing values as per the dataset info


#  Encoding categorical variables

# One-Hot Encoding for 'Home Ownership'
ohe = OneHotEncoder(sparse=False, drop='first')
home_ownership_encoded = ohe.fit_transform(cs_original[['Home Ownership']])

# Convert the One-Hot encoded columns into DataFrames
home_ownership_df = pd.DataFrame(home_ownership_encoded, columns=ohe.get_feature_names_out(['Home Ownership']))

# Concatenate the new columns to the original DataFrame
cs = pd.concat([cs_original, home_ownership_df], axis=1).drop(['Gender', 'Home Ownership'], axis=1)

# Define the order for the 'Education' feature and apply Label Encoding
education_order = ['High School Diploma', "Associate's Degree", "Bachelor's Degree", "Master's Degree", 'Doctorate']
le_education = LabelEncoder()
le_education.fit(education_order)
cs['Education'] = le_education.transform(cs['Education'])

# Label encode 'Marital Status'
marital_status_order = ['Single', 'Married']
le_marital_status = LabelEncoder()
le_marital_status.fit(marital_status_order)
cs['Marital Status'] = le_marital_status.transform(cs['Marital Status'])

# Label encode 'Credit Score'
le_credit_score = LabelEncoder()
cs['Credit Score'] = le_credit_score.fit_transform(cs['Credit Score'])

# Binning the 'Credit Score' into categories using KBinsDiscretizer
discretizer = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='uniform')
cs['Credit Score Category'] = discretizer.fit_transform(cs[['Credit Score']]).astype(int)

# Mapping the binned categories to original labels
credit_score_labels = {0: 'Low', 1: 'Average', 2: 'High', 3: 'Excellent'}
cs['Credit Score Category'] = cs['Credit Score Category'].map(credit_score_labels)





Splitting the Dataset

In [25]:
# Step 4: Splitting the Dataset
X = cs.drop(['Credit Score', 'Credit Score Category'], axis=1)
y = cs['Credit Score Category']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [26]:
# Step 5: Training the Naive Bayes Model
model = GaussianNB()
model.fit(X_train, y_train)

In [27]:
# Step 6: Evaluating the Model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Classification Report:\n{classification_rep}')

Accuracy: 1.0
Classification Report:
              precision    recall  f1-score   support

   Excellent       1.00      1.00      1.00         5
        High       1.00      1.00      1.00        23
         Low       1.00      1.00      1.00         5

    accuracy                           1.00        33
   macro avg       1.00      1.00      1.00        33
weighted avg       1.00      1.00      1.00        33



**# Conclusion**

The conclusion based on the classification report and accuracy score is as follows:

**Accuracy:** The Naive Bayes model achieved a perfect accuracy of 100%, indicating that it correctly classified all instances in the test dataset.

**Precision, Recall, and F1-score:**

For each class ('Excellent', 'High', 'Low'), precision, recall, and F1-score are all 1.00, indicating perfect performance. This means that the model correctly identified all instances of each class, without any false positives or false negatives.

**Support:** The support column represents the number of instances of each class in the test dataset. It shows that the model was tested on a total of 33 instances across all classes.

**Conclusion:**

The Naive Bayes model performed exceptionally well on the given dataset, achieving perfect classification accuracy. This suggests that the features used for classification are highly predictive of the target variable (Credit Score Category).