# Data Cleaning and Summary Statistics

In [1]:
%autosave 0

Autosave disabled


In [2]:
# print all the outputs in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import pandas as pd

Load the data of the survey taken by Data Science students.

In [3]:
df  = pd.read_csv('data science survey.csv')

In [4]:
#Can utilize relative path to load the data file

#df  = pd.read_csv('data/data science survey.csv')

In [6]:
df.head(1)

Unnamed: 0,Timestamp,Do you have a job?,How long ago did you get your Bachelor degree?,What program are enrolled in?,How would you rate your computer programming background?,Have you ever programmed in C?,Have you ever programmed in C++?,Have you ever programmed in C#?,Have you ever programmed in Java?,Have you ever programmed in Python?,Have you ever programmed in Javascript?,Have you ever programmed in R?,Have you ever programmed in SQL?,Have you ever used SAS?,Have you ever used Excel?,Have you ever used Tableau?,Have you ever run a regression?,"How familiar are you with the Machine Learning task called ""classification""?","How familiar are you with the Machine Learning task called ""clustering""?"
0,2017/01/09 2:48:20 PM MST,"No, I'm not working at the moment",longer than 1 year ago but less than 3 years ago,MSIS,4,Yes,Yes,No,Yes,Yes,Yes,No,Yes,No,Yes,No,Yes,4,4


We don't like this data set for a variety of reasons:
<ul>
<li>Columnn headers are too long
<li>Some cell values are too long
<li>Some cell values are yes/no, but we prefer 1/0
</ul>

## Cleaning

### Let's change some of the column names

In [5]:
df.columns = ['Timestamp',
             'Job',
             'BachTime',
             'Program',
             'ProgSkills',
             'C',
             'CPP',
             'CS',
             'Java',
             'Python',
             'JS',
             'R',
             'SQL',
             'SAS',
             'Excel',
             'Tableau',
             'Regression',
             'Classification',
             'Clustering']

Now the column names look a lot better!


### Let's remove timestamps

Suppose that we don't need the timestamps. Here is how to remove a column

### Replace job with 0 (no job), 0.5 (part time), and 1 (full time)

<p>We want to replace the values of the column "Job", as follows:</p>
<p>
<ul>
<li><i>No, I'm not working at the moment</i> --> 0
<li><i>Yes, I have a part-time job</i> --> 0.5
<li><i>Yes, I have a full-time job</i> --> 1
</ul>
</p>
<p>
We will show three alternative solutions (1, 2, and 3) to perform this task. They will result in the creation of three columns (Job1, Job2, and Job3). At the end, we will delete the original column Job, we will delete two of these columns, and we will rename the remaining column "Job".
</p>

#### Solution 1 (column Job1)

Create a column 'Job1' through <i>df.loc</i>.

#### Solution 2

Here, we will use the function <b>apply</b> on the column <i>Job</i>. The function "apply" requires as input a function that specifies how to transform each value. The function should perform the following transformations:
<ul>
<li>No, I'm not working at the moment => 0</li>
<li>Yes, I have a part-time job => 0.5</li>
<li>Yes, I have a full-time job => 1</li>
</ul>

### Another version of Job2Num

#### Solution 3

Instead of declaring a function as above, we can pass a lambda (or anonymous) function

#### Finalize

The DataFrame has now the original column <i>Job</i> and three identical columns <i>Job1</i>, <i>Job2</i>, and <i>Job3</i>. We delete the original column <i>Job</i>, as well as <i>Job1</i> and <i>Job2</i>, and then we will rename the remaining column from <i>Job3</i> to <i>Job</i>.

### Replace <i>BachTime</i> with a dummy variable

The time from graduation <i>BachTime</i> can have the following values:
<ul>
<li>less than 1 year ago</li>
<li>longer than 1 year ago but less than 3 years ago</li>
<li>longer than 3 years ago but less than 5 years ago</li>
<li>over 5 years ago</li>
</ul>

We want to conver the column BachTime to <i>dummy variables</i>, that is we want to create four columns ('Bach_0to1', 'Bach_1to3', 'Bach_3to5', 'Bach_5Plus') of which only one will be 1 and the others 0. These columns will indicate which is the duration of the time from the bachelor's degree.

The method <i>get_dummies</i> performs precisely this task. It transforms a column that contains $V$ unique values into $V$ new column (one for each value $v \in V$).

Let's create two dummy dataframes to try out two different methods.

The column names generated by the method <i>get_dummies</i> are very wordy because they use the original cell contents. So, let us rename the columns. We can use two different methods:

### option 1:
<b>We are using dumDF1</b>

We can do it column by column with rename():

### Or, we do the following:

<b>We are using dumDF</b>
1. transform the columns into a list
2. modify the list
3. set the new columns vector all at once

Overwrite our original dataframe df

### Replace Yes with 1 and No with 0

Let us replace everywhere "Yes" with 1 and "No" with 0

### Adding simple columns

Let's add a column that counts how many programming languages students know. Languages = C+CPP+CS+Java+Python+JS+R+SQL+SAS

To rearrange column order:<br/>
<b>reindex(columns=[the columns in the order that you want])</b>

Let's place <i>Languages</i> first

### Adding complex columns (advanced topic)

Let's add a 0-1 column called "Expert" if <i>Languages</i> is 3 or more.

First of all, note that you can do it very easily with what you already know:

But suppose that the calculation is complicated. First, we need to define a function that given a row (i.e., a Series that represents a student) returns 1 if the student is and expert (and 0 otherwise). This row has index labels 'Job', 'BachTime, 'Program', etc)

Second, we use the function <b>apply</b> with axis=1, which applies the function across columns and returns a Series:

Alternatively, instead of defining a function with <i>def</i>, we can use a lambda function, which allows us to define the function "inline"

<b>Warning</b>: <i>DataFrame.apply</i> is slow, especially when called on all rows (i.e., axis=1). If you can, you should use operations among Series and scalars. The performance on the example above can be improved a lot as follows:

### DataFrame.apply vs Series.apply

In pandas 0.20, DataFrame.apply is slow whereas Series.apply is fast. For example, say that we want to add a column <i>ProgramLower</i> with the lower case value of Program.

#### Solution 1: DataFrame.apply (slow! Do not use!)

#### Solution 2: Series.apply (fast because it uses vectorization)

### Write it to a file

### Summary Functions

pandas provides many simple "summary functions" which restructure the data in some useful way. 

### .describe()

This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the dtype of the input. The output above only makes sense for numerical data; for string data here's what we get:

To see a list of unique values we can use the **unique** function:

To see a list of unique values and how often they occur in the dataset, we can use the **value_counts** method:

versus .count()

**.shape** Return a tuple representing the dimensionality of the DataFrame.

## Problems

How many students know SQL?

What's the average programming skills of MSIS students? Compare it to that of MBA students

How many students know classification better than clustering? And how many clustering better than classification?

## Some Analysis

Summary statistics

Correlation

## Find the strongest correlations

What are the top 10 correlations and the top 10 anti-correlations? 

To answer this question, we need to "stack" the result of <i>cor.stack()</i>, which means turning it into a Series with a "Hierarchical index" (that is, an index of two elements).

Remove the correlations equal to 1 (they are self correlations); then, pick one correlation every two (as they all appear twice)

Use list slicing: x[startAt:endBefore:skip]

Do the same for negative correlations

Least significant correlations
