## Project Scenario

You work in a university's health and wellness center.

Mental health is an area that is severely neglected, and can have very serious ramifications such as student self-harm and depression.

You are determined to identify students at risk using data so you can help them as early as possible.

In this project, you will explore a dataset obtained from foreign students studying in a Japanese university.

1. Acquire data on mental health of foreign students in Japan (Part I)
2. Perform exploratory data analysis and test a few hypotheses (Part II)
3. Transform the data for machine learning (Part III)
4. Train a machine learning model based on several hypotheses (Part IV)

## Part I

### Step 1: Download the dataset and read the research publication
The dataset we are working with comes from the research of Nguyen et al (2019), where the authors obtained a record of 268 questionaire results of depression, acculturative stress, social connectedness, and help-seeking behaviour by a cohort of local and international students.

More details <a href = 'https://www.mdpi.com/2306-5729/4/3/124/htm'>here</a>.

Download the data <a href = 'https://www.mdpi.com/2306-5729/4/3/124/s1'>here</a> and unzip the file in your project folder.

We highly recommend reading Tables 1 and 2 in the publication to understand what the headers in your dataset mean, and we will refer to it from time to time. 

### Step 2: Import pandas
Let's import pandas to read the data unzipped from the file you downloaded.

In [14]:
import pandas as pd

### Step 3: Read CSV as DataFrame
Now that you've import the library, go ahead and read the CSV as a DataFrame

Make sure you have your variable alone in the last line of your code cell so you can preview your DataFrame.

In [15]:
df = pd.read_csv('~/Downloads/Project University Mental Health/data.csv')

In [16]:
df.head()

Unnamed: 0,inter_dom,Region,Gender,Academic,Age,Age_cate,Stay,Stay_Cate,Japanese,Japanese_cate,...,Friends_bi,Parents_bi,Relative_bi,Professional_bi,Phone_bi,Doctor_bi,religion_bi,Alone_bi,Others_bi,Internet_bi
0,Inter,SEA,Male,Grad,24.0,4.0,5.0,Long,3.0,Average,...,Yes,Yes,No,No,No,No,No,No,No,No
1,Inter,SEA,Male,Grad,28.0,5.0,1.0,Short,4.0,High,...,Yes,Yes,No,No,No,No,No,No,No,No
2,Inter,SEA,Male,Grad,25.0,4.0,6.0,Long,4.0,High,...,No,No,No,No,No,No,No,No,No,No
3,Inter,EA,Female,Grad,29.0,5.0,1.0,Short,2.0,Low,...,Yes,Yes,Yes,Yes,No,No,No,No,No,No
4,Inter,EA,Female,Grad,28.0,5.0,1.0,Short,1.0,Low,...,Yes,Yes,No,Yes,No,Yes,Yes,No,No,No


In [17]:
df.tail()

Unnamed: 0,inter_dom,Region,Gender,Academic,Age,Age_cate,Stay,Stay_Cate,Japanese,Japanese_cate,...,Friends_bi,Parents_bi,Relative_bi,Professional_bi,Phone_bi,Doctor_bi,religion_bi,Alone_bi,Others_bi,Internet_bi
281,,,,,,,,,,,...,222,,,,,,,,,
282,,,,,,,,,,,...,249,,,,,,,,,
283,,,,,,,,,,,...,203,,,,,,,,,
284,,,,,,,,,,,...,247,,,,,,,,,
285,,,,,,,,,,,...,223,,,,,,,,,


### Step 4: Investigate what's wrong with the CSV
Wait a minute - if you displayed your DataFrame, you might have noticed something. The DataFrame you just read has a lot of missing data at the end. What's going on?

Take a better look at your DataFrame, more specifically the last few rows. There are two ways to do it:
1. Open your file in Excel and take a look
2. Use the .tail method of your DataFrame, and look at the last 20 rows

In [18]:
df.tail(20)

Unnamed: 0,inter_dom,Region,Gender,Academic,Age,Age_cate,Stay,Stay_Cate,Japanese,Japanese_cate,...,Friends_bi,Parents_bi,Relative_bi,Professional_bi,Phone_bi,Doctor_bi,religion_bi,Alone_bi,Others_bi,Internet_bi
266,Dom,JAP,Male,Under,19.0,2.0,1.0,Short,5.0,High,...,Yes,Yes,Yes,Yes,Yes,Yes,No,No,No,No
267,Dom,JAP,Male,Under,20.0,2.0,2.0,Medium,5.0,High,...,Yes,No,No,No,No,No,No,Yes,No,No
268,,,,,,,,,,,...,,,,,,,,,,
269,,,,,,,,,,,...,128,137,66,61,30,46,19,65,21,45
270,,,,,,,,,,,...,140,131,202,207,238,222,249,203,247,223
271,,,,,,,,,,,...,,,,,,,,,,
272,,,,,,,,,,,...,128,137,66,61,30,46,19,65,21,45
273,,,,,,,,,,,...,140,131,202,207,238,222,249,203,247,223
274,,,,,,,,,,,...,,,,,,,,,,
275,,,,,,,,,,,...,123,,,,,,,,,


### Step 5: Remove the weird rows
When you check the data out - in Step 5 - you will see that there are random values at the bottom of the file that led to the many NaNs that you see at the bottom row.

Remove the last 18 rows, making sure that your resultant DataFrame has only <strong>268 rows x 50 columns</strong>.

There are many ways to do it, and here are some suggestions:
1. Slice the DataFrame (be careful about selecting the right index)
2. Drop NaN, using inter_dom as your subset reference column
3. Opening your CSV in Excel and deleting those values directly (make sure you redo Step 4 before going to Step 7)

In [19]:
clean_df = df.dropna(subset=['inter_dom'])

clean_df

Unnamed: 0,inter_dom,Region,Gender,Academic,Age,Age_cate,Stay,Stay_Cate,Japanese,Japanese_cate,...,Friends_bi,Parents_bi,Relative_bi,Professional_bi,Phone_bi,Doctor_bi,religion_bi,Alone_bi,Others_bi,Internet_bi
0,Inter,SEA,Male,Grad,24.0,4.0,5.0,Long,3.0,Average,...,Yes,Yes,No,No,No,No,No,No,No,No
1,Inter,SEA,Male,Grad,28.0,5.0,1.0,Short,4.0,High,...,Yes,Yes,No,No,No,No,No,No,No,No
2,Inter,SEA,Male,Grad,25.0,4.0,6.0,Long,4.0,High,...,No,No,No,No,No,No,No,No,No,No
3,Inter,EA,Female,Grad,29.0,5.0,1.0,Short,2.0,Low,...,Yes,Yes,Yes,Yes,No,No,No,No,No,No
4,Inter,EA,Female,Grad,28.0,5.0,1.0,Short,1.0,Low,...,Yes,Yes,No,Yes,No,Yes,Yes,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
263,Dom,JAP,Female,Under,21.0,3.0,4.0,Long,5.0,High,...,Yes,Yes,No,No,No,No,No,No,No,Yes
264,Dom,JAP,Female,Under,22.0,3.0,3.0,Medium,3.0,Average,...,Yes,Yes,Yes,No,No,No,No,No,No,No
265,Dom,JAP,Female,Under,19.0,2.0,1.0,Short,5.0,High,...,Yes,Yes,Yes,Yes,Yes,Yes,No,No,No,No
266,Dom,JAP,Male,Under,19.0,2.0,1.0,Short,5.0,High,...,Yes,Yes,Yes,Yes,Yes,Yes,No,No,No,No


### Step 6: Find the total number of missing values in each column
Now that we've a cleaner DataFrame, let's sum up the null values in each column. 

This is so that we can assess whether we need to clean the DataFrame some more.

In [20]:
clean_df.isna().sum()

inter_dom           0
Region              0
Gender              0
Academic            0
Age                 0
Age_cate            0
Stay                0
Stay_Cate           0
Japanese            0
Japanese_cate       0
English             0
English_cate        0
Intimate            8
Religion            0
Suicide             0
Dep                 0
DepType             0
ToDep               0
DepSev              0
ToSC                0
APD                 0
AHome               0
APH                 0
Afear               0
ACS                 0
AGuilt              0
AMiscell            0
ToAS                0
Partner             0
Friends             0
Parents             0
Relative            0
Profess             0
 Phone              0
Doctor              0
Reli                0
Alone               0
Others              0
Internet           26
Partner_bi          0
Friends_bi          0
Parents_bi          0
Relative_bi         0
Professional_bi     0
Phone_bi            0
Doctor_bi 

### Step 7: Replace the missing values with median
Seems like only one column has missing values. It's not a lot, i.e. around 10% of the total number of rows, so you can just go ahead and replace the NaN with the median. 

Replace the missing values in 'Internet' column with the median of the column.


In [21]:
median = clean_df['Internet'].median()

clean_df['Internet'] = clean_df['Internet'].fillna(median)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clean_df['Internet'] = clean_df['Internet'].fillna(median)


### Step 8: Check missing values again
Just repeat Step 7 to see if your replacement worked.

In [22]:
clean_df.isna().sum()

inter_dom          0
Region             0
Gender             0
Academic           0
Age                0
Age_cate           0
Stay               0
Stay_Cate          0
Japanese           0
Japanese_cate      0
English            0
English_cate       0
Intimate           8
Religion           0
Suicide            0
Dep                0
DepType            0
ToDep              0
DepSev             0
ToSC               0
APD                0
AHome              0
APH                0
Afear              0
ACS                0
AGuilt             0
AMiscell           0
ToAS               0
Partner            0
Friends            0
Parents            0
Relative           0
Profess            0
 Phone             0
Doctor             0
Reli               0
Alone              0
Others             0
Internet           0
Partner_bi         0
Friends_bi         0
Parents_bi         0
Relative_bi        0
Professional_bi    0
Phone_bi           0
Doctor_bi          0
religion_bi        0
Alone_bi     

### Step 9: Export cleaned DataFrame as CSV
Now that you've done some cleaning up and filled in missing values, it's time to export the DataFrame as a CSV.

In [23]:
clean_df.to_csv('~/Downloads/Project University Mental Health/clean_data.csv')