#### CSCE 676 :: Data Mining and Analysis :: Fall 2019


# Basic Data Exploration

*Notebook overview:* This notebook shows off some basic data exploration skills using NumPy, Matplotlib, Seaborn and Pandas.

## Load Data Sets
In Python, it is easy to load data from any source with various formats (TXT, CSV, JSON, XLS) by employing predefined packages. For instance, our data is with the format of CSV, which is read by making use of Pandas that features a number of functions for importing tabular data as a [DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) object. 

Taking as an example, load in recent_grads.csv and save it into a DataFrame object.

In [None]:
import pandas as pd

import pandas as pd
data_path='./data/recent-grads.csv'

# reading the csv file into a dataframe object using Pandas
df = pd.read_csv(data_path)
# print out all variables
print(list(df.columns.values))

## Calculate Basic Statistics
So far, the data has been succesfully loaded into the memory of your computer. It is a good time to calculate some basic statisticas in terms of some variables in this dataset, e.g., mean, median, standard deviation, max, min, etc. by leveraging [NumPy](http://www.numpy.org).

First of all, let's have a look at how many the samples are in this dataset.

In [None]:
numObv, numVar = df.shape
print("This dataset contains {0} samples with {1} variables.".format(numObv, numVar))

And then, let's calculate the mean of some selected variables in the dataset.

In [None]:
import numpy as np

# calculate the mean of the total number of students across all majors
df_total = df['Total']
mean_total = int(round(np.mean(df_total)))
print("The mean of the total number of students across all majors is {0}.".format(mean_total))

# calculate the mean of the total number of full-time students across all majors
df_fulltime = df['Full_time']
mean_fulltime = int(round(np.mean(df_fulltime)))
print("The mean of the total number of full-time students across all majors is {0}.".format(mean_fulltime))

# calculate the ratio of full-time students
ratio_fulltime = float(mean_fulltime)/mean_total
print("The ratio of full-time students is {0:0.2f}.".format(ratio_fulltime))

Let's take a look at how to calculate the median of variables selected above.

In [None]:
# calculate the median of the total number of students across all majors
median_total = int(round(np.median(df_total)))
print("The median of the total number of students across all majors is {0}.".format(mean_total))

# calculate the median of the total number of full-time students across all majors
mean_fulltime = int(round(np.median(df_fulltime)))
print("The median of the total number of full-time students across all majors is {0}.".format(mean_fulltime))

Following the similar routine, we can easily calulate other basic statistics with respect to these variables. All these statistics can be visulized by making use of [Matplotlib](http://matplotlib.org) introduced later in this notebook.

In [None]:
# calculate the standard deviation of the total number of students across all majors
std_total = np.std(df_total)
print("The standard deviation of the total number of students across all majors is {0:0.2f}.".format(std_total))

# calculate the standard deviation of the total number of full-time students across all majors
std_fulltime = np.std(df_fulltime)
print("The standard deviation of the total number of full-time students across all majors is {0:0.2f}.".format(std_fulltime))

In [None]:
df_major = df['Major']

# calculate the max of the total number of students across all majors
max_total = np.max(df_total)
max_idx_total = np.argmax(df_total)
max_major = df_major[max_idx_total]
print("The max of the total number of students across all majors is {0} in the major {1}.".format(max_total, max_major))

# calculate the min of the total number of students across all majors
min_total = np.min(df_total)
min_idx_total = np.argmin(df_total)
min_major = df_major[min_idx_total]
print("The min of the total number of students across all majors is {0} in the major {1}.".format(min_total, min_major))

## Data Visualization
Data visualization always helps to understand the data easily. Python has library like matplotlib and seaborn to create multiple graphs effectively. Let’s look at the some of the visualization to understand some variable(s) in the dataset.

### Histogram
Let's check the distribution of ratio of female students across all majors by plotting histogram.

In [None]:
# special IPython command to prepare the notebook for matplotlib
%matplotlib inline

import matplotlib.pyplot as plt

df_sharewomen = df['ShareWomen']

plt.hist(df_sharewomen, bins=10)
plt.title('Distribution of Ratio of Female Students')
plt.xlabel('Ratio')
plt.ylabel('Number of Majors')
plt.show()

### Scatter Plot
Let's check if female students are more towards majors with less students. 

In [None]:
plt.scatter(df_total, df_sharewomen, c="g", alpha=0.5)
plt.title('Total number of students VS. Ratio of female students')
plt.xlabel('Total Number of Students')
plt.ylabel('Ratio of Female Students')
plt.show()

### Box-Plot
Let's produce a box-plot for the rate of unemployment with different major categories. 

In [None]:
import seaborn as sns
# generating a boxplot for unemployment rate across different major categories
ax = sns.boxplot(x="Major_category", y="Unemployment_rate",  data=df)
plt.xlabel('Major Category')
plt.ylabel('Unemployment Rate')
plt.title('Unemployment Rate for Various Major Categories')
plt.xticks(rotation=60)
plt.show()

From the box-plot generated above, it is obvious that majors related to Business have lower unemployment rates comparing with other majors.