### Assignment 4. Mental Health Disorder Classification 

This assignment is to develop a decision tree model (random forests) to predict mental health disorder classification from a collection of symptoms.  The objective of this study is to determine whether we can develop quantitative analysis that predicts mental health diagnosis, and useful results that can inform practice (without doing all this analysis)
 
This data set is mostly categorical variables, e.g., mood swings (yes/no).  For categorical decision trees can make more intuitive sense since the boundaries separate the different categories.


In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV  # this is a new method
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.inspection import permutation_importance



- Load Data

In [None]:
df = pd.read_csv('cleaned_mental_disorder_data.csv')
df.info()

Every single variable in this data set is categorical!  Lets take a quick look 

In [None]:
# display the first few rows
print("First few rows of the dataset:")
df.head(5)

Use the scroll bar to look at all the variables.  
We can see that everything has been encoded as text strings, even potentially numeric variables like Sexual Acitivity, Concentration, and Optimism.  

Because I obsess about these things, I am going to fix the misspelling on two column headers. 

In [None]:
df = df.rename(columns={'Anorxia': 'Anorexia', 'Expert Diagnose':'Expert Diagnosis', 'Sleep dissorder':'Sleep disorder'})


1. Are there any missing values in the data set?  If there are remove those observations from the data. 

Because this is categorical data, it will be useful to understand what are the possible values of each variable.  We can use the unique method to find this 

In [None]:
# check for unique value
for column in df.columns:
    unique_values = df[column].unique()
    print(f"\nUnique values in '{column}':\n{unique_values}")

- There is of course a brute force approach to encoding this data numerically that could work, but thats throwing out the baby with the bath water. 

- There are 4 types of variables. 
    - 'Expert Diagnosis' has 4 categorical variables that bear no particular relationship.  Since this is the target variable for classification, we can actually leave them as is and it will facilitate making most of the tables and plots. DO NOT ENCODE!
    - 7 variables have 'YES' and 'NO' as the two possibilities.  It would be ideal to encode 'YES' as 1 and 'NO' as 0 
    - 3 variables are encoded as '3 from 10', etc.  This indicates that on a scale of 10 a numeric score of 3 was given.  These should be ideally converted to numeric values 
    - 4 variables have descriptive values, [Seldom' 'Sometimes' 'Usually' ''Most-Often'].   These are actually ordered values, wwith 'Seldom' as the least, and 'Most-Often' as the most.They should be numerically encoded to reflect their relative strength. 

Since we desire to make a careful encoding, using pd.get_dummies will not work that well.  We should instead be specific about the encoding.  Here is an example 

In [None]:
# desired encoding categorical variables

encoder = {'Seldom':1, 'Sometimes':2, 'Usually':3, 'Most-Often':4}
for j in df.columns[0:4]:
    df[j] = df[j].map(encoder)



2. Using similar logic, encode the other predictor variables as numeric quantities.  

Exploratory Data Analysis - You need to do this to understand your results! 
Since these are all categorical variables, it makes the most sense to use histograms or count plots.  There are a lot of plots to look at.  When you do that, its very useful to come up with a systematic way to label things using colors to help your reader. 

3. Make a plot using sns.countplot that shows the number of patients with each diagnosis.  when you make this plot, choose the `palette` argument carefully, because this will assign each of the diagnosis a different color.  We want to keep that consistent throughout the notebook.  I  control this by setting `hue` = 'Expert Diagnosis' and `palette` = 'bright' but there are other popular palettes.  You can suppress the legend using `legend` = False.  Also take control of the order in which the different diagnosis are presented so they make sense, by provding the `order` as a list of the diagnoses.

Now, plot all the predictors variables as histograms. My call inside a loop lindexed by i ooked like this - 

`sns.histplot(data=df, x=df.columns[i], hue='Expert Diagnosis', hue_order = ['Depression','Bipolar Type-1','Bipolar Type-2','Normal'],multiple="dodge", kde=True, palette="bright",legend=False)`

You dont need a legend here, because the previous plot assigned a color to each diagnosis and I am following the same color scheme here.  

4.  Split the data into test and training sets. I put 30% into my test set.  Separate the target 'Expert Diagnosis' as the y variable and all the predictors as X.  
- **Its very important to stratify correctly.**  That is, we want the test data to be balanced in the same way as the training data.  
- If you called the target variable y, then include a parameter `stratify = y` when you split the data 

5. Set up a random forest classifiers.  We need to use GridSearchCV to optimize two hyperparameters - `n_estimators`, the number of trees and max_depth.  I used n_estimators ranging from 100 to 400 and max_depth ranging from 1 to 5.  For scoring lets use accuracy.  I usually do 5-fold cross validation by default. Fit the classifier to the training data. 

Report 
    - best max-depth
    - best n_estimators
    - best Accuracy 

Extract the best rf model from your fit.   

6. Evaluate your classifier by predicting the test data.

Present 

- the accuracy score 
- classification report.  

In the markdown below, write down what the precision and recall values are telling you about hen this model performs well

Comment on accuracy, precision, and recall. 



7. Make a confusion matrix to look at the pattern of misclassification.  When you do this, you will see that it labels the diagnoses as 0,1,2,3.  This is because internally, random forest mapped the conditions onto numeric values.  To find the mapping, look at `.classes_' of your best random forest object. In my case, it looked like this. 

['Bipolar Type-1' 'Bipolar Type-2' 'Depression' 'Normal']

so thats the order of conditions, and you should be able to fix your labels. 


8. To evaluate which predictors were useful for the classifier, use permutation_importance rather than partial dependence.  When we have more than 2 classes, permutation_important becomes harder to interpret as the model considers classifying one versus the rest.  Permutation importance tells us how much accuracy will decline if randomize the variable.  

Write 1-2 paragraphs that summarizes

1.  what is the data and what is the question being asked of the data. 
2. what the model you made was, and how well the model performed.  What kind of errors if any does it make.  
3. What features of the data were most informative of mental health diagnosis.  
4. Can you provide a simple explanation of how to make a preliminary diagnosis that could be provided to counselors and first responders? 
5. If you had limited space and could only show a few of these figures, which ones would you show? 