In this notebook I am going to perform EDA and binary classification using catboost classifier.

**EDA:**

* Finding correlation between categorical features and target using swarmplots.
* Distribution plot for age.
* Plot of top countries participated in survey.

**Classification:**

Classification using catboost classifier.

Why catboost?
* Because we don't need to preprocess categorical features for using catboost.
* Handle class imbalance using class weights.

**Let's import all required packages**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn .metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier

seed = 143
np.random.seed(seed)

%matplotlib inline

Load the data

In [None]:
df = pd.read_csv('../input/autism-screening-on-adults/autism_screening.csv')

In [None]:
len(df)

Glimpse the data

In [None]:
df.head()

In [None]:
df.columns

## EDA

In [None]:
df.nunique()

By looking at unique values for columns we can say that, all are categorical columns.

Since column 'age_desc' has just one value throughout the data, we can drop it.

In [None]:
df.drop(columns=['age_desc'], inplace=True)

Let's check for null values in data

In [None]:
df.isnull().sum()

Age column has 2 nan values, let's replace them with 0.

In [None]:
df['age'].fillna(0, inplace=True)

Let's encode target class labels YES and NO with 0 and 1 respectively.

In [None]:
df['Class/ASD'] = df['Class/ASD'].apply(lambda x: int(x == 'YES'))

Let's plot individual categorical columns agaist target to see how they are related.

In [None]:
sns.swarmplot(x='Class/ASD', y='result', data=df)

We can clearly see from above plot that, result is highly correlated with Class/ASD. There is clear separation between class 0 and 1.

Observations:

if 
* result > 6 , Class/ASD = 1
* otherwise, Class/ASD = 0


In [None]:
sns.distplot(df['age'])

From above plot we can say that, average age of candidates is somewhere around 25-30.

In [None]:
cols = ['gender','jundice','austim','used_app_before']
for i in cols:
    sns.swarmplot(x='Class/ASD', y=i, data=df)
    plt.show()

From above plots we can say that, no columns from above list shows correlation with target class. There is no clear separation of classes.

Let's plot top 15 countries participated in survey based on their count.

In [None]:
fig = sns.barplot(y=df['contry_of_res'].value_counts().index[:15], x=df['contry_of_res'].value_counts().values[:15], data=df)
fig.set(xlabel='Count', ylabel='Country')
plt.show()

## Prediction

In [None]:
df['Class/ASD'].value_counts()

There is a class imbalance problem. We will handle this by specifying class_weights in classifier later.

Before jumping into building a classifier, let's shuffle the data and split it into train and test sets.

In [None]:
df = df.sample(frac=1, random_state=seed)
train_x,test_x,train_y,test_y = train_test_split(df.drop(columns=['Class/ASD']), df['Class/ASD'], test_size=0.2, random_state=seed)

Let's build the classifier.

In [None]:
cat_features = [i for i in df.columns if i not in ['Class/ASD', 'age', 'result']]

clf = CatBoostClassifier(
    iterations=10,
    verbose=5,
    class_weights = [1,2]
)

clf.fit(
    train_x, train_y ,
    cat_features=cat_features,
)

In [None]:
pred_y = clf.predict(test_x)

In [None]:
accuracy_score(pred_y, test_y)

In [None]:
f1_score(pred_y, test_y)

Perfect f1 score of 1!!

Let's find the feature importance.

In [None]:
clf.get_feature_importance()