## 1. About the Data

This dataset contains information about international student numbers , their enrolments, nationality, field of study etc.

Student numbers data is available from year 2002 till October 2020. However the data related to enrolments, commencements and field of study is available for 2019 and 2020 (till October).


The data is available in 2 csv files mentioned below:

1. studentsPublic.csv     ---> This file provides the actual student numbers in Australia by nationality and State.
2. nationalitySummary.csv ---> This file provides details about field of study, total enrolments and commencements.

In [None]:
## Read the csv files .
import pandas as pd
df1 = pd.read_csv('../../data/studentPublic.csv')
df2 = pd.read_csv('../../data/nationalitySummary.csv')

In [None]:
## Lets examine the columns for both the files.

all_cols = [list(df1.columns), list(df2.columns)]
all_cols

In [None]:
## Get an inersection to get the columns common in both the dataframes.

my_function = lambda x, y: set(x).intersection(set(y))
my_function(df1.columns, df2.columns)

### It is clear from the above output that both the datasets provide completely different attributes related to international students in Australia. The columns common to both the dataseta are Month, Nationality and Year only. So, it is more appropriate to clean and analyse the datasets separately.

## 2. Inspecting and Cleaning the Student numbers data (studentsPublic.csv)

In [None]:
## Import the required modules

import pandas as pd
import numpy as np
from functools import reduce
pd.set_option('display.max_columns', None)

In [None]:
## Read the csv file and load into a dataframe.

df_students = pd.read_csv('../../data/studentPublic.csv')

In [None]:
## check the type

type(df_students)

In [None]:
## Check the number of rows and columns in the student dataset.

df_students.shape

In [None]:
## Get some information about the dataframe created earlier.

df_students.info()

In [None]:
## Check the list of columns present in the dataset.

df_students.columns

In [None]:
# Normalising  the columns.
df_students.columns = [col.lower() for col in df_students.columns]
df_students.columns

In [None]:
## Get the first few rows at the beginning of the data set.

df_students.head(10)

In [None]:
## Get few rows at the end of the data set.

df_students.tail(10)

In [None]:
## Get the count of missing values for all the coloumns

df_students.isna().sum()

We can see that all the columns have no missing values except last column, 'Growth'. This was mostly expected as the dataset is pre-processed available on the Department of Education website. 

In [None]:
## The possible reason for missing 'growth'  column could be attributed to the fact that there may be no students for the previous year.
## lets check this out

null_data = df_students[df_students.isnull().any(axis=1)]
null_data

In [None]:
## Get the data for one country (say Chad) where growth column is null/missing.

null_data = null_data[(null_data.nationality == 'Chad') & (null_data.state == '_All')]
null_data.sort_values(by = 'year')

## Dealing with missing values
It can be seen from the above outout that growth column is null for an year where there is no data for the previous year.
Hence it is reasonable to fill the missing values with 0 for the growth column

In [None]:
## Filling the missing values with 0

df_students = df_students.fillna(0)

In [None]:
## Get the count of missing values for all the coloumns after 'fillna' operation

df_students.isna().sum()

### Great, there are no missing values in the student dataset. Let us export the clean data into a new csv (studentsPublic_Clean.csv)

In [None]:
len(df_students)

In [None]:
df_students.to_csv('../../data/studentPublic_Clean.csv', index=False)

In [None]:
## Read the cleaned csv file and load into a new dataframe.

df_students_clean = pd.read_csv('../../data/studentPublic_Clean.csv')

In [None]:
dups = df_students_clean.duplicated()
dups.sum()

In [None]:
df_students_clean.info()

## 3. Inspecting and Cleaning the Student numbers data (nationalitySummary.csv)

In [None]:
## Read the csv file and load into a dataframe.

df_enrol = pd.read_csv('../../data/nationalitySummary.csv')

In [None]:
## check the type

type(df_enrol)

In [None]:
## Check the number of rows and columns in the student dataset.

df_enrol.shape

In [None]:
## Get some information about the dataframe created earlier.

df_enrol.info()

In [None]:
## Check the list of columns present in the dataset.

df_enrol.columns

In [None]:
# Normalising  the columns.
df_enrol.columns = [col.lower() for col in df_enrol.columns]
df_enrol.columns

In [None]:
## Get the first few rows at the beginning of the data set.

df_enrol.head(10)

In [None]:
## Get few rows at the end of the data set.

df_enrol.tail(10)

In [None]:
## Get the count of missing values for all the coloumns

df_enrol.isna().sum()

In [None]:
## lets check this out for missing values.

null_data1 = df_enrol[df_enrol.isnull().any(axis=1)]
null_data1

In [None]:
## Get the data for one country (say Niger) where enrolmentsgrowth column is null/missing.

null_data1 = null_data1[(null_data1.nationality == 'Niger')]
null_data1.sort_values(by = 'year')

## Dealing with missing values

It can be seen from the above outout that enrolmentsgrowth column is null for an year where there is no data for the previous year. e.g. 'Non AQF Award' has no enrolments in 2019. So, this is null for 2020.
Hence it is reasonable to fill the missing values with 0.

In [None]:
## Filling the missing values with 0

df_enrol = df_enrol.fillna(0)

In [None]:
## Get the count of missing values for all the coloumns after 'fillna' operation

df_enrol.isna().sum()

## Great, there are no missing values in the enrolments dataset. Let us export the clean data into a new csv (nationalitySummary_Clean.csv)

In [None]:
len(df_enrol)

In [None]:
df_enrol.to_csv('../../data/nationalitySummary_Clean.csv', index=False)

In [None]:
## Read the cleaned csv file and load into a new dataframe.

df_enrol_clean = pd.read_csv('../../data/nationalitySummary_Clean.csv')

In [None]:
dups = df_enrol_clean.duplicated()
dups.sum()

In [None]:
df_enrol_clean.info()

In [None]:
df_enrol_clean.isna().sum()