This is a Python code that imports several libraries such as numpy, pandas, seaborn, matplotlib, and scikit-learn. It also sets some options for filtering warnings. The code also includes the implementation of decision tree and random forest classifiers.

In [47]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sb
import matplotlib.pyplot as plt
from warnings import filterwarnings
filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

This code reads in two CSV files 'Training.csv' and assigns them to two separate dataframes called 'train' and 'test'. It then creates two new dataframes 'A' and 'B' which are copies of the 'train' and 'test' dataframes respectively.

In [48]:
train = pd.read_csv("/content/Training.csv")
test = pd.read_csv('/content/Training.csv')
A = train
B = test

In [49]:
A.head()

Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,...,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,prognosis,Unnamed: 133
0,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,
1,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,
2,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,
3,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,
4,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,


In [50]:
B.head()

Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,...,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,prognosis,Unnamed: 133
0,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,
1,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,
2,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,
3,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,
4,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,


This code counts the number of occurrences of each unique value in the 'prognosis' column of the Pandas DataFrame 'A'. The 'prognosis' column is assumed to be a categorical variable.

In [51]:
A.prognosis.value_counts()

Fungal infection                           120
Hepatitis C                                120
Hepatitis E                                120
Alcoholic hepatitis                        120
Tuberculosis                               120
Common Cold                                120
Pneumonia                                  120
Dimorphic hemmorhoids(piles)               120
Heart attack                               120
Varicose veins                             120
Hypothyroidism                             120
Hyperthyroidism                            120
Hypoglycemia                               120
Osteoarthristis                            120
Arthritis                                  120
(vertigo) Paroymsal  Positional Vertigo    120
Acne                                       120
Urinary tract infection                    120
Psoriasis                                  120
Hepatitis D                                120
Hepatitis B                                120
Allergy      

This will give the number of missing values in each column of the A dataframe

In [52]:
A.isna().sum()

itching                    0
skin_rash                  0
nodal_skin_eruptions       0
continuous_sneezing        0
shivering                  0
                        ... 
blister                    0
red_sore_around_nose       0
yellow_crust_ooze          0
prognosis                  0
Unnamed: 133            4920
Length: 134, dtype: int64

In [53]:
B.isna().sum()

itching                    0
skin_rash                  0
nodal_skin_eruptions       0
continuous_sneezing        0
shivering                  0
                        ... 
blister                    0
red_sore_around_nose       0
yellow_crust_ooze          0
prognosis                  0
Unnamed: 133            4920
Length: 134, dtype: int64

So there is no missing data in training dataset

Define X and Y - from training data and P for testing data

This code separates the target variable "prognosis" from the rest of the features in the dataset A and assigns them to Y and X variables, respectively. Y is a Pandas DataFrame consisting of a single column "prognosis", while X is a DataFrame consisting of all the columns in A except for "prognosis". This is a common data preprocessing step in machine learning tasks where we need to separate the target variable from the rest of the data before training a model.

In [54]:
Y = A[["prognosis"]]
X = A.drop(["prognosis"],axis=1)

training data splitting

Model 1 - Random Forest

This code uses SimpleImputer from the scikit-learn library to impute missing values in the input data. The most frequent value is used to fill in the missing values. After imputing the missing values, the data is split into training and testing sets using train_test_split from the scikit-learn library. The RandomForestClassifier model is then trained on the training set and used to make predictions on both the training and testing sets. The accuracy score of the model is printed for both the training and testing sets.

In [57]:
from sklearn.impute import SimpleImputer

# Create an imputer object
imp = SimpleImputer(strategy='most_frequent')

# Impute missing values
X = imp.fit_transform(X)

# Split the data into train and test sets
xtrain, xtest, ytrain, ytest = train_test_split(X, Y, test_size=0.2, random_state=42)

# Train the model
rfc = RandomForestClassifier(random_state=42)
model_rfc = rfc.fit(xtrain, ytrain)

# Make predictions
tr_pred_rfc = model_rfc.predict(xtrain)
ts_pred_rfc = model_rfc.predict(xtest)

# Evaluate the model
print("training accuracy is:", accuracy_score(ytrain, tr_pred_rfc))
print("testing accuracy is:", accuracy_score(ytest, ts_pred_rfc))


training accuracy is: 1.0
testing accuracy is: 1.0


Model 2 - Navie Bayies

The code is training a Naive Bayes model using the GaussianNB class from the scikit-learn library. The fit method is then called on the model object, passing in the training data xtrain and corresponding labels ytrain. This will train the model on the provided data, using the Naive Bayes algorithm to estimate the parameters of the conditional probability distributions for each feature, given each class labe

In [58]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(xtrain,ytrain)

In [59]:
y_pred = gnb.predict(xtest)
from sklearn import metrics
print("Gaussian Naive Bayes model accuracy(in %):", metrics.accuracy_score(ytest, y_pred)*100)

Gaussian Naive Bayes model accuracy(in %): 100.0


Model 3 - SVM

This code uses the Support Vector Machine (SVM) algorithm to create a model for classification.

The first two lines import the required libraries SVC and metrics from sklearn module.

Then, an SVM model is instantiated with parameters C=0.1, kernel='linear', and gamma=1. The C parameter controls the trade-off between achieving a low training error and a low testing error, the kernel parameter specifies the type of kernel function used in the algorithm, and the gamma parameter defines the kernel coefficient.

Next, the model is trained using the training set (xtrain and ytrain) by calling the fit() method on the SVM model object.

After that, the model is used to make predictions on the test set (xtest) by calling the predict() method on the SVM model object, and the predicted labels are stored in the prediction variable.

Finally, the accuracy of the SVM model is calculated by comparing the predicted labels with the actual labels of the test set (ytest) using the accuracy_score() function from the metrics module. The accuracy is printed as a percentage using print() function.

In [61]:
from sklearn.svm import SVC
from sklearn import metrics

svc_model = SVC(C=0.1, kernel='linear', gamma=1)
svc_model.fit(xtrain, ytrain)
 
prediction = svc_model.predict(xtest)
print("SVM model accuracy (in %):", metrics.accuracy_score(ytest, prediction) * 100)


SVM model accuracy (in %): 100.0


This will print a classification report containing precision, recall, F1-score, and support for each class in the target variable. The report also includes macro and weighted averages of these metrics. The report can be useful for evaluating the overall performance of a classification model, as well as identifying specific areas for improvement.

In [63]:
from sklearn.metrics import classification_report

y_pred = svc_model.predict(xtest)
print(classification_report(ytest, y_pred))


                                         precision    recall  f1-score   support

(vertigo) Paroymsal  Positional Vertigo       1.00      1.00      1.00        18
                                   AIDS       1.00      1.00      1.00        30
                                   Acne       1.00      1.00      1.00        24
                    Alcoholic hepatitis       1.00      1.00      1.00        25
                                Allergy       1.00      1.00      1.00        24
                              Arthritis       1.00      1.00      1.00        23
                       Bronchial Asthma       1.00      1.00      1.00        33
                   Cervical spondylosis       1.00      1.00      1.00        23
                            Chicken pox       1.00      1.00      1.00        21
                    Chronic cholestasis       1.00      1.00      1.00        15
                            Common Cold       1.00      1.00      1.00        23
                           