# Basic Process

- Obtaining and loading data (e.g. from CSV)
- Exploring the data (e.g. data set size, class balance)
- Preparing the data (e.g. split train and test data, text vectorisation)
- Model fitting (e.g. using a pipeline, multinomial naive bayes)
- Model evaluation (e.g. accuracy and confusion matrix)
- Model application (e.g. one prediction)

### 1 Loading Data

In [None]:
import pandas as pd

df = pd.read_csv("data/mental_health.csv")
df.head()

### 2 Exploratory Data Analysis (EDA)

In [None]:
df.info()
df.label.value_counts()

### 3 Data Preparation

In [None]:
from sklearn.model_selection import train_test_split

X = df["text"]
y = df["label"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=17)

### 4 Model Fitting

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

pipe = Pipeline([("tfidf", TfidfVectorizer()), ("svm", MultinomialNB())])
pipe.fit(X_train, y_train)

### 6 Evaluation

In [None]:
from sklearn import metrics

y_predicted = pipe.predict(X_test)
accuracy = metrics.accuracy_score(y_true=y_test, y_pred=y_predicted)
confusion = metrics.confusion_matrix(y_true=y_test, y_pred=y_predicted)
print(accuracy)
print(confusion)

### 7 Application

In [None]:
a = pipe.predict(['''
    nothing look forward lifei dont many 
    reasons keep going feel like nothing 
    keeps going next day makes want hang myself
    '''])
print(a)