# Beck's Depression Inventory 

The Beck Depression Inventory (BDI) is a 21-item, self-rated scale that evaluates key symptoms of depression including mood, pessimism, sense of failure, self-dissatisfaction, guilt, punishment, self-dislike, self-accusation, suicidal ideas, crying, irritability, social withdrawal, indecisiveness, body image change, work difficulty, insomnia, fatigability, loss of appetite, weight loss, somatic preoccupation, and loss of libido (Beck & Steer, 1993; Beck, Steer & Garbing, 1988).

You can reach the test in below ( as long as the link is alive ^^ )

[Beck's Depression Inventory Test](https://www.ismanet.org/doctoryourspirit/pdfs/Beck-Depression-Inventory-BDI.pdf)


Calculation is based on summing the scores to the questions. The sum of the 21 BDI score, gives the intensity of the depression. Hence, it will vary from 0 to 63.

In [None]:
#PACKAGES WILL BE NEED FOR UTILITY

import numpy as np # linear algebra
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_validate
import warnings
warnings.filterwarnings("ignore")
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        path = os.path.join(dirname, filename)
        print(os.path.join(dirname, filename))

#READ THE DATA
data = pd.read_csv(path)

As you see, our data is consist of 26 inputs. We drop the first column since it is just a index of the test. 

In [None]:
data=data.iloc[:,1:]
data.head(4)

Checking the data for cleaning is essential, hopefully there is no null data in the dataframe.

In [None]:
data.describe()
data.isnull().sum()
#data = data.dropna()
#data = data.reset_index(drop=True)

We have to add all colums of the points which means sum all values in the B1 to B21 to get the final BDI score of depression.

In [None]:
BDI = data.iloc[:,5:]
#SUM ALL BDI SCORES FOR BDI TOTAL SCORE
BDI["BDI TOTAL"] = BDI.sum(axis=1)    

As be seen in below, we concat the data set to gather a meaningful dataframe. Now we have 5 input features which are Gender, Education, Working Status, Marriage Style, Having kid and an output as BDI score. All inputs are categoric and the output is seem numeric. Thus, the problem turns into a **regression task**. 


However, we will see!. I have doubts about that may be we can turn it into sub-groups which leads the problem into a classification task. hmm interesting...

In [None]:
#DROP BDI 1 - BDI 21 FROM DATASET AND CONCAT BDI TOTAL

dataset = data.iloc[:,:5]
BDI_SUM = BDI.iloc[:,-1:]
df=pd.concat([dataset, BDI_SUM], axis=1)
df.head(4)

In [None]:
df.isnull().sum()

Here is important, categorical values can be encoded. In our case they are already encoded but we will make it as one hot encoding. Why, Because it may help the succession of the regression model. Still not sure, it should be experienced.

In [None]:
from sklearn import preprocessing
ohe= preprocessing.OneHotEncoder()

gender = df.iloc[:,:1]
gender_hot = ohe.fit_transform(gender).toarray()

education = df.iloc[:,1:2]
education_hot = ohe.fit_transform(education).toarray()

working = df.iloc[:,2:3]
working_hot = ohe.fit_transform(working).toarray()

marriage = df.iloc[:,3:4]
marriage_hot = ohe.fit_transform(marriage).toarray()

siblings = df.iloc[:,4:5]
siblings_hot = ohe.fit_transform(siblings).toarray()


In [None]:
dfgender =pd.DataFrame(data=gender_hot, index=range(gender.shape[0]), columns= ["female","male"])
dfeducation = pd.DataFrame(data=education_hot, index=range(gender.shape[0]), columns= ["primary","high scool","Bachelor","Msc or PhD"])
dfworking =pd.DataFrame(data=working_hot, index=range(working.shape[0]), columns= ["employed","unemployed"])
dfmarriage =pd.DataFrame(data=marriage_hot, index=range(marriage.shape[0]), columns= ["arranged","flirt"])
dfchild =pd.DataFrame(data=siblings_hot, index=range(siblings.shape[0]), columns= ["Have kids","Not Have kids"])

df=pd.concat([dfgender,dfeducation,dfworking,dfmarriage,dfchild],axis=1)

In [None]:
x=pd.concat([dfgender,dfeducation,dfworking,dfmarriage,dfchild],axis=1)
y= BDI.iloc[:,-1:]
data = pd.concat([x,y],axis=1)

In [None]:
korelasyon=data.corr()
figure, axis=plt.subplots(figsize=(10,10))
sns.heatmap(korelasyon, annot=True)

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2, random_state=4)

# ML Models Solve as Regression Task

We will use linear regression, Decision Tree, Random Forest and Support Vector Regressor to build a model. 

**Linear Regression**

R square is minus, which means output and input are not related. Other words garbage in garbage out.

In [None]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x_train,y_train)
y_pred = reg.predict(x_test)
r2_score(y_test, y_pred)

**Decision Tree**

Same, R square is minus, which means output and input are not related. Other words garbage in garbage out.

In [None]:
from sklearn.tree import DecisionTreeRegressor
RTD = DecisionTreeRegressor(random_state = 0)
RTD.fit(x_train, y_train)
y_pred = RTD.predict(x_test)
r2_score(y_test, y_pred)

**Random Forest**
Same, R square is minus, which means output and input are not related. Other words garbage in garbage out.

In [None]:
from sklearn.ensemble import RandomForestRegressor

RFR = RandomForestRegressor(n_estimators=10,random_state = 0 )
RFR.fit(x_train, y_train)
y_pred=RFR.predict(x_test)
r2_score(y_test, y_pred)

**Support Vector Machines: Regressor**

In [None]:
from sklearn.svm import SVR

SVR_Reg = SVR(kernel = "rbf",degree=3, C=40)
SVR_Reg.fit(x_train, y_train)
y_pred = SVR_Reg.predict(x_test)
r2_score(y_test, y_pred)

CrossValidation

Same, R square is minus, which means output and input are not related. Other words garbage in garbage out.

In [None]:
cv_sonuc= cross_validate(RFR, x, y, cv=5 , scoring=('r2', 'neg_mean_squared_error'))
res=cv_sonuc['test_r2'].mean()
print("R2 Score is: ", res*100, "%")

**Any ideas why the score is low? Garbage Data?** Next time i will try to turn the problem into a classification task, may be it helps!

# ML Models Solve as Classification Task


Discritize by Bining of the target value.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score as ass
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB

#Linear Discriminant Analysis kütüphaneleri
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.covariance import LedoitWolf
from sklearn.covariance import MinCovDet
from sklearn.covariance import OAS
from sklearn.covariance import GraphicalLasso

Dicritize by bining


1-10____________________These ups and downs are considered normal

11-16___________________ Mild mood disturbance

17-20___________________Borderline clinical depression

21-30___________________Moderate depression

31-40___________________Severe depression

over 40__________________Extreme depression

In [None]:
import warnings
warnings.filterwarnings("ignore")
bins = [-np.inf, 10, 16, 20, 30, 40, np.inf]
labels = ["normal", "mid-mood","borderline","moderate-depression","severe-depression","extreme-depression"]

y['binned BDI'] = pd.cut(y['BDI TOTAL'], bins = bins, labels = labels)
y.head(81)

In [None]:
y=y.iloc[:,-1:]
y.isnull().sum()

Models to compare:

* K Nearest Neighbors
* Logistic Regression
* Decision Tree
* Stochastic Gradient Descent
* Support Vector Classifier
* Random Forest
* Gaussian Naive Bayes
* Multinomial Naive Bayes
* Linear Discriminant Analysis

Although we do not examine each model in detail, it should not be forgotten that there are parameters that must be determined by cross-validation within the models. We call these parameters hyperparameters. We need to do these parameters manually or with the structures we call grid-search.

In [None]:
k_nn=KNeighborsClassifier(n_neighbors=3, metric="chebyshev")
logi = LogisticRegression(random_state=5)
DT = DecisionTreeClassifier(max_features="sqrt")
SDF = SGDClassifier(penalty="l2", random_state=10)
S_VC= SVC(degree=3,C=8, kernel="rbf")
RF= RandomForestClassifier(n_estimators=78, criterion= "gini") # criterion = "gini" or "entropy"
Bayes=  GaussianNB()
MBayes = MultinomialNB()
BBayes = BernoulliNB()
LDA = LinearDiscriminantAnalysis(solver="eigen")    #solver= ‘svd’, ‘lsqr’, ‘eigen’
Result =[]

In [None]:
cv_sonuc= cross_validate(k_nn, x, y, cv=5 , scoring='accuracy')
res=cv_sonuc['test_score'].mean()
print("Accuracy of KNN: ", res*100, "%")
Result.append( "KNN :")
Result.append( res)

In [None]:
cv_sonuc= cross_validate(SDF, x, y, cv=5 , scoring='accuracy')
res=cv_sonuc['test_score'].mean()
print("Accuracy of SDG: ", res*100, "%")

Result = []
Result.append( "SDG :")
Result.append( res)

In [None]:
cv_sonuc= cross_validate(logi, x, y, cv=5 , scoring='accuracy')
res=cv_sonuc['test_score'].mean()
print("Accuracy of Logistic Regression: ", res*100, "%")
Result.append( "LR :")
Result.append( res)

In [None]:
cv_sonuc= cross_validate(DT, x, y, cv=5 , scoring='accuracy')
res=cv_sonuc['test_score'].mean()

print("Accuracy of Decision Tree: ", res*100, "%")
Result.append( "DT :")
Result.append( res)

In [None]:
cv_sonuc= cross_validate(S_VC, x, y, cv=5 , scoring='accuracy')
res=cv_sonuc['test_score'].mean()
print("Accuracy of Support Vector Classifier: ", res*100, "%")
Result.append( "SVC :")
Result.append( res)

In [None]:
cv_sonuc= cross_validate(RF, x, y, cv=5 , scoring='accuracy')
res=cv_sonuc['test_score'].mean()
print("Accuracy of Random Forest: ", res*100, "%")
Result.append( "RF :")
Result.append( res)

In [None]:
cv_sonuc= cross_validate(Bayes, x, y, cv=5 , scoring='accuracy')
res=cv_sonuc['test_score'].mean()
print("Accuracy of Naive Bayes: ", res*100, "%")
Result.append( "NB :")
Result.append( res)

In [None]:
cv_sonuc= cross_validate(MBayes, x, y, cv=5 , scoring='accuracy')
res=cv_sonuc['test_score'].mean()
print("Accuracy of Multinomial Naive Bayes: ", res*100, "%")
Result.append( "MNB :")
Result.append( res)

In [None]:
Result

# Results: 

Highest Accuracy is 56%

SDG: 0.51

KNN: 0.36

Logistic Regression: 0.56

Decision Tree: 0.43

Support Vector Classifier: 0.51

Random Forest:: 0.49

Naive Bayes - Gauissian: 0.13

Naive Bayes - Multinomial: 0.56

Linear Discriminant Analysis: NaN