Problem statement:
Ayurveda- oldest medical system known to mankind, treats the person using herbs. 
This traditional science strongly believes that the diseases are due to imbalance in Body humors called- Doshas. 
If these Doshas increases, they will cause disease. 
To treat the disease, Ayurveda uses various herbs ( and some minerals too). 
These herbs act on the Doshas through-

Rasa (taste): Taste of the drug will have their influence on Doshas. 
    Ex. Sweet taste (madhura Rasa) will decrease the Vata and Pitta
    
Guna (Properties): Certain properties like oiliness, heaviness will increase Kapha

Virya (Potency): It can be of Usha (hot) or Sheeta (Cold) rarely Anushna (either hot nor cold) that will have impact of Doshas

Vipaka (Final transformation): It can be of Madhura (sweet), Amla (sour) and Katu (pungent- hot taste)

Dosha: The overall outcome of any drug can be understood as VS (Vata shamaka- decreases Vata), PS (Pitta shamaka- decreases Pitta), KS(Kapha shamaka- decreases Kapha), TS (Tridosha shamaka- decreases Vata, Pitta and Kapha), VPS (Vata Pitta shamaka- decreases Vata and Pitta), VKS (Vata Kapha shamaka- decreases Vata and Kapha),  PKS (Pitta Kapha shamaka- decreases Pitta and Kapha)

So, in the given data set, Dosha is the target variable i.e we try to understand that which of the above components- Rasa, Guna, Virya, Vipaka, will have impact on Doshas


### Step1 : Reading and understanding the data  

In [None]:
# importing the required libraries

import numpy as np, pandas as pd
import matplotlib.pyplot as plt, seaborn as sns
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:100% !important;}</style>"))
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

In [None]:
# Reading the dataset and storing it in a new dataframe 'data'

data = pd.read_csv('../input/dosha-prediction/dravya_data1.csv')

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.value_counts()

### Step2 : Data Cleaning

In [None]:
data.info()

In [None]:
# there are null values ( in Rasa2 and Guna2) but it is not advisible to replace them.

In [None]:
# checking uniquness of distribution  

In [None]:
data['Guna1'].value_counts()

In [None]:
## some of the columns names are not unique and hence merging such columns(like laghu, snigdha, tikshna)

In [None]:
data['Guna1'] = data['Guna1'].str.strip()
data['Guna1'].value_counts()

In [None]:
# from 13, it is brought down to 8 

In [None]:
data['Guna2'].value_counts()

In [None]:
data['Guna2'] = data['Guna2'].str.strip()
data['Guna2'].value_counts()

In [None]:
data['Rasa1'].value_counts()

In [None]:
data['Rasa1'] = data['Rasa1'].str.strip()
data['Rasa1'].value_counts()

In [None]:
data['Rasa2'].value_counts()

In [None]:
data['Vipaka'].value_counts()

In [None]:
data['Vipaka'] = data['Vipaka'].str.strip()
data['Vipaka'].value_counts()

In [None]:
data['Virya'].value_counts()

In [None]:
data['Virya'] = data['Virya'].str.strip()
data['Virya'].value_counts()

In [None]:
data['Dosha'].value_counts()

In [None]:
data.Dosha.value_counts(normalize=True)

### Step3 : Data Preparation

In [None]:
## Assessing the categorical variables

In [None]:
cat_cols = data.select_dtypes("object").columns
cat_cols

In [None]:
plt.figure(figsize=[20,7])
for ind, col in enumerate(cat_cols):
    plt.subplot(2,5,ind+1)
    data[col].value_counts(normalize=True).plot.barh()
    plt.title(col)
plt.show()

In [None]:
data[col].value_counts(normalize=True)

#### Creating dummy variables for the categorical variables

In [None]:
dumm_cols = ['Guna1', 'Guna2', 'Rasa1', 'Rasa2', 'Vipaka', 'Virya']

In [None]:
dravya_dummies = pd.get_dummies(data[dumm_cols], drop_first=True)

In [None]:
dravya_dummies.head()

In [None]:
dravya_dummies.shape

In [None]:
## Preparing final data by concating original data with dummy data set 

In [None]:
final_data = pd.concat([data, dravya_dummies], axis=1)

In [None]:
final_data.shape

In [None]:
final_data.head()

In [None]:
final_data = final_data.drop(dumm_cols,axis=1)

In [None]:
final_data.describe()

---

# # step 4 Model building

#### Dividing into train and test datasets

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
df_train, df_test = train_test_split(final_data, test_size=0.2, random_state=42) 

In [None]:
df_train.shape, df_test.shape

## Building predictive models

In [None]:
X_train = df_train.drop(['Dosha','Sanskrit','Sl No '], axis=1)
y_train = df_train['Dosha']
X_test = df_test.drop(['Dosha','Sanskrit','Sl No '], axis=1)
y_test = df_test['Dosha']

In [None]:
X_train.shape

In [None]:
y_train.shape, y_test.shape

In [None]:
X_train.shape

In [None]:
X_train.columns

##creating Decision Tree graph

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dt = DecisionTreeClassifier(max_depth=3)
dt.fit(X_train, y_train)

In [None]:
import sklearn

In [None]:
print(sklearn.tree.export_graphviz(dt, 
                 filled=True, rounded=True,
                 special_characters=True, feature_names = X_train.columns,
                 class_names=['VS', "PS", 'KS', 'TS', 'VPS', 'VKS', 'PKS'])) 

In [None]:
!pip install pydotplus

In [None]:
# Importing required packages for visualization
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus, graphviz
from six import StringIO 

In [None]:
# plotting tree with max_depth=6
dot_data = StringIO()  

export_graphviz(dt, out_file=dot_data, filled=True, rounded=True,
                feature_names=X_train.columns, 
                class_names=['VS', "PS", 'KS', 'TS', 'VPS', 'VKS', 'PKS'])

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
Image(graph.create_png(),width=800,height=900)
graph.write_pdf("Dosha_Dravyas")

In [None]:
print(export_graphviz(dt, filled=True, rounded=True, special_characters=True,
               feature_names=X_train.columns, 
                 class_names=['VS', "PS", 'KS', 'TS', 'VPS', 'VKS', 'PKS']))

In [None]:
print(dot_data.getvalue())

Evaluating model performance

In [None]:
y_train_pred = dt.predict(X_train)
y_test_pred = dt.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score

In [None]:
print(accuracy_score(y_train, y_train_pred))
confusion_matrix(y_train, y_train_pred)

In [None]:
print(accuracy_score(y_test, y_test_pred))
confusion_matrix(y_test, y_test_pred)

In [None]:
## Not much of huge difference between train (72%) and test (68%) accuracy scores.

Creating helper functions to evaluate model performance and help plot the decision tree

In [None]:
def get_dt_graph(dt_classifier):
    dot_data = StringIO()
    export_graphviz(dt_classifier, out_file=dot_data, filled=True,rounded=True,
                    feature_names=X_train.columns, 
                    class_names=['VS', "PS", 'KS', 'TS', 'VPS', 'VKS', 'PKS'])
    graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
    return graph

In [None]:
def evaluate_model(dt_classifier):
    print("Train Accuracy :", accuracy_score(y_train, dt_classifier.predict(X_train)))
    print("Train Confusion Matrix:")
    print(confusion_matrix(y_train, dt_classifier.predict(X_train)))
    print("-"*50)
    print("Test Accuracy :", accuracy_score(y_test, dt_classifier.predict(X_test)))
    print("Test Confusion Matrix:")
    print(confusion_matrix(y_test, dt_classifier.predict(X_test)))

Without setting any hyper-parameters

In [None]:
dt_default = DecisionTreeClassifier(random_state=42)
dt_default.fit(X_train, y_train)

In [None]:
gph = get_dt_graph(dt_default)
Image(gph.create_png())

In [None]:
evaluate_model(dt_default)

Controlling the depth of the tree

In [None]:
?DecisionTreeClassifier

In [None]:
dt_depth = DecisionTreeClassifier(max_depth=3)
dt_depth.fit(X_train, y_train)

In [None]:
gph = get_dt_graph(dt_depth) 
Image(gph.create_png())

In [None]:
# lets check with 4 depth-
dt_depth = DecisionTreeClassifier(max_depth=4)
dt_depth.fit(X_train, y_train)
gph = get_dt_graph(dt_depth) 
Image(gph.create_png())

In [None]:
evaluate_model(dt_depth)

Specifying minimum samples before split

In [None]:
dt_min_split = DecisionTreeClassifier(min_samples_split=50)
dt_min_split.fit(X_train, y_train)

In [None]:
gph = get_dt_graph(dt_min_split) 
Image(gph.create_png())

In [None]:
evaluate_model(dt_min_split)

Specifying minimum samples in leaf node

In [None]:
dt_min_leaf = DecisionTreeClassifier(min_samples_leaf=25, random_state=42)
dt_min_leaf.fit(X_train, y_train)

In [None]:
gph = get_dt_graph(dt_min_leaf)
Image(gph.create_png())

In [None]:
evaluate_model(dt_min_leaf)

Using Entropy instead of Gini

In [None]:
dt_min_leaf_entropy = DecisionTreeClassifier(min_samples_leaf=25, random_state=42, criterion="entropy")
dt_min_leaf_entropy.fit(X_train, y_train)

In [None]:
gph = get_dt_graph(dt_min_leaf_entropy)
Image(gph.create_png())

In [None]:
evaluate_model(dt_min_leaf_entropy)

Hyper-parameter tuning

In [None]:
dt = DecisionTreeClassifier(random_state=42)

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# Create the parameter grid based on the results of random search 
params = {
    'max_depth': [2, 3, 5, 10],
    'min_samples_leaf': [5, 10, 20],
    'criterion': ["gini", "entropy"]
}

In [None]:
grid_search = GridSearchCV(estimator=dt, 
                       param_grid=params, 
                            cv=4, n_jobs=-1, verbose=1, scoring = "f1")

In [None]:
# Instantiate the grid search model
grid_search = GridSearchCV(estimator=dt, 
                           param_grid=params, 
                           cv=4, n_jobs=-1, verbose=1, scoring = "accuracy")

In [None]:
%%time
grid_search.fit(X_train, y_train)

In [None]:
score_df = pd.DataFrame(grid_search.cv_results_)
score_df.head()

In [None]:
score_df.nlargest(5,"mean_test_score")

In [None]:
grid_search.best_estimator_

In [None]:
dt_best = grid_search.best_estimator_

In [None]:
evaluate_model(dt_best)

In [None]:
from sklearn.metrics import classification_report

In [None]:
y_pred = dt_best.predict(X_train)

In [None]:
res = df_train
res['pred'] = y_pred

In [None]:
res.shape

In [None]:
res[res.pred.isnull()]

In [None]:
res[['Sanskrit','Dosha','pred']].head(10)

In [None]:
print(classification_report(y_test, dt_best.predict(X_test)))

In [None]:
gph = get_dt_graph(dt_best)
Image(gph.create_png(), width=2000, height=2500)

## Image(gph.create_png(), width=900, height=1000)

In [None]:
## extracting some of the drugs based on certain above creteria-

In [None]:
A = X_train[X_train['Virya_ushna'] < 0.5]

In [None]:
B= A[A['Rasa1_madhura']< 0.5]
C= B[B['Vipaka_katu']< 0.5]
D = C[C['Rasa2_kashaya']> 0.5]
D

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf = RandomForestClassifier(random_state=42, n_estimators=10, max_depth=3)

In [None]:
rf.fit(X_train, y_train)

In [None]:
rf.estimators_[0]

In [None]:
sample_tree = rf.estimators_[4]

In [None]:
gph = get_dt_graph(sample_tree)
Image(gph.create_png(), width=900, height=1000)

In [None]:
gph = get_dt_graph(rf.estimators_[2])
Image(gph.create_png(), width=900, height=900)

OOB score

In [None]:
rf = RandomForestClassifier(random_state=42, n_estimators=10, max_depth=3, oob_score=True)

In [None]:
rf.fit(X_train, y_train)

In [None]:
rf.oob_score_

In [None]:
evaluate_model(rf)

In [None]:
# this is relatively better as train and test accuracy are very close

Grid search for hyper-parameter tuning

In [None]:
classifier_rf = RandomForestClassifier(random_state=42, n_jobs=-1)

In [None]:
# Create the parameter grid based on the results of random search 
params = {
    'max_depth': [1, 2, 5],
    'min_samples_leaf': [5, 10, 20],
    'max_features': [2,3,4],
    'n_estimators': [10, 30, 50, 100, 200]
}

In [None]:
# Instantiate the grid search model
grid_search = GridSearchCV(estimator=classifier_rf, param_grid=params, 
                          cv=4, n_jobs=-1, verbose=1, scoring = "accuracy")

In [None]:
%%time
grid_search.fit(X_train,y_train)

In [None]:
rf_best = grid_search.best_estimator_

In [None]:
rf_best

In [None]:
evaluate_model(rf_best)

In [None]:
sample_tree = rf_best.estimators_[0]

In [None]:
gph = get_dt_graph(sample_tree)
Image(gph.create_png())

In [None]:
gph = get_dt_graph(rf_best.estimators_[0])
Image(gph.create_png(), height=900, width=900)

In [None]:
gph = get_dt_graph(rf_best.estimators_[7])
Image(gph.create_png(), height=900, width=900)

Variable importance in RandomForest and Decision trees

In [None]:
classifier_rf = RandomForestClassifier(random_state=42, n_jobs=-1, max_depth=5, n_estimators=100, oob_score=True)

In [None]:
classifier_rf.fit(X_train, y_train)

In [None]:
classifier_rf.feature_importances_

In [None]:
imp_df = pd.DataFrame({
    "Varname": X_train.columns,
    "Imp": classifier_rf.feature_importances_
})

In [None]:
imp_df.sort_values(by="Imp", ascending=False)

In [None]:
## As per the above observation, Virya plays important role (17+17=34%) in deciding the doshic action followed by Madhura Rasa

Comments: 
1. the same thing has been documented in the classical text books of Ayurveda. Virya is considered as - "Utkrishtashaktisampannaguna" i.e. Highly powered qualities of any drug. 

2. the least importance has been give to 'Sara Guna' by the machine and even in the typical ayurvedic practice we see that Physicians gives least importance for the 'Sara Guna' while choosing any drug for Dosha reduction.