<p style="text-align:center">
PSY 394U <b>Data Analytics with Python</b>, Spring 2018


<img style="width: 400px; padding: 0px;" src="https://github.com/sathayas/JupyterAnalyticsSpring2018/blob/master/images/Title_pics.png?raw=true" alt="title pics"/>

</p>

<p style="text-align:center; font-size:40px; margin-bottom: 30px;"><b>Overview, Tools, & Python refresher</b></p>

<p style="text-align:center; font-size:18px; margin-bottom: 32px;"><b>January 18, 2018</b></p>

<hr style="height:5px;border:none" />

# 1. Overview
<hr style="height:1px;border:none" />

We will be dealing with non-traditional data analysis techniques and data throughout the semester. Here, non-traditional is in contrast to traditional statistical approaches of inference and estimation. 

## Machine learning
The goal is to ***learn*** from the data in an attempt to predict. This data-driven approach is in contrast to a traditional statistical approach in which data are assumed to originate from a certain model and/or distribution. We will use **`Scikit-learn`** to implement some algorithms. 

### Unsupervised learning
You only have a collection of observations as your data, without any labels. The goal is to categorize them into groups solely based on what is available from the data.

<img style="width: 300px; padding: 0px;" src="https://github.com/sathayas/JupyterAnalyticsSpring2018/blob/master/images/Intro_UnsupervisedEg.png?raw=true" alt="Clustering example"/>

### Supervised learning
You have a collection of observations as your data, along with some outcome measures. Your goal is to come up with an algorithm to predict the outcome for a new observation. 

<img style="width: 300px; padding: 0px;" src="https://github.com/sathayas/JupyterAnalyticsSpring2018/blob/master/images/Intro_SupervisedEg.png?raw=true" alt="Support vector machine example"/>


## Network analysis
Here, your data set describes relationships (referred as edges) among individual units (e.g., people, servers, brain areas, etc; referred as nodes). There are many things you can learn from such data. We will use functions from **`NetworkX`** to analyze network data.

<img style="width: 300px; padding: 0px;" src="https://github.com/sathayas/JupyterAnalyticsSpring2018/blob/master/images/Intro_Network.png?raw=true" alt="Les Mis network"/>

<p style="text-align:center; font-size:10px;">Source: http://web.madstudio.northwestern.edu/re-visualizing-the-novel/</p>


## Text mining
A branch of natural language processing. We can extract information from text data. We will use tools available in **`NLTK`** to extract data from texts.

<img style="width: 250px; padding: 0px;" src="https://github.com/sathayas/JupyterAnalyticsSpring2018/blob/master/images/Intro_WordTally.png?raw=true" alt="Word tally chart"/>






# 2. Git
<hr style="height:1px;border:none" />

Example codes, data, and notes associated with this course will be available from **GitHub** repositories. You can view individual files on GitHub via the links provided. However, since there will be a large number of files associated with this course, it is a good idea to *clone* (or make a local copy) of the repositories using **Git**. Once you clone a repository, it is very easy to update your local copy with changes and additional files, as I update the GitHub repository throughout the semester. 

### Cloning a repository
In order to clone a GitHub repository, *you need to have Git installed on your computer*. The installation instructions are available on Canvas. To clone a repository, you run the following command on the terminal app on Mac or the GitBASH on Windows:
```
git clone [URL] [Directory Name]
```
Where `[URL]` is the location of a GitHub repository and `[Directory Name]` is the name of the directory where the contents of the cloned repository will reside. I highly recommend cloning two repositories. One is for codes and data, the other is for notes. Here are their URLs:

  * Codes & data:   https://github.com/sathayas/AnalyticsClassSpring2018
  * Notes:          https://github.com/sathayas/JupyterAnalyticsSpring2018

You should create two separate directories for these.


### Updating a repository
Once you clone a repository, then you can update your repository whenever I make change to the original GitHub repository (e.g., editing some codes, adding new codes or data, etc.). To do so, you need to go to the directory where you cloned the GitHub repository, and run the following command:
```
git pull
```
This should retrieve any changes and addition to your repository from the GitHub repository. 


### Overwriting local changes and updating

Say, you happen to modify some codes locally on your computer, and I also modify the same codes on my GitHub repository. Then if you just simply run `git pull`, then you will likely get a warning, and your codes will not be synched with the GitHub version. One way to avoid this is to overwrite any local changes you have made before `git pull`. You need to run
```
git reset --hard
git pull
```
Note that if you do this, any local changes will be lost.


### More on Git

If you are interested in learning more about how to use **Git**, you can refer to my tutorial note on Git and GitHub at

      https://github.com/sathayas/JupyterPythonFall2017/blob/master/Git.ipynb

There are many other tricks associated with Git (e.g., forking). I will leave that to Git's documentation at 

      https://git-scm.com/doc

# 3. Jupyter notebook
<hr style="height:1px;border:none" />

## What is Jupyter notebook?

A Jupyter notebook document is an interactive document that lets you run Python right there in your document. It is an ideal setting for teaching and learning how to code. You may have noticed that course notes posted on Canvas are links to Jupyter notebook documents (**`.ipynb`**) stored on GitHub. You can still view my notes on your web browser, but if you open my note on Jupyter notebook, then you can actually run code snippets on your computer.

To use Jupyter notebook, you need to install Jupyter notebook on your computer, in addition to Python. The installation instructions can be found the in the installation documentation available on Canvas.

## How to work with a Jupyter notebook document

### Starting up
To start out a Jupyter notebook, you open a Terminal (Mac) or Command Prompt (Windows) app. Then type in the command:

```
jupyter notebook
```

This should open up a web browser, and it should look like this:

<img src="https://github.com/sathayas/JupyterPythonFall2017/blob/master/images/Jupyter_StartUp.png?raw=true" alt="Starting up Jupyter notebook" style="width: 500px;"/>

Using the file browser within your web browser, find a Jupyter notebook document. It should look something like this:

<img src="https://github.com/sathayas/JupyterPythonFall2017/blob/master/images/Jupyter_Notebook.png?raw=true" alt="An example of Jupyter notebook" style="width: 500px;"/>

### Code cells
Within a notebook document, there are some codes that can be executed on the spot. You simply find the code snippet, a box with **`In [ ]:`** on the left margin.

<img src="https://github.com/sathayas/JupyterPythonFall2017/blob/master/images/Jupyter_CodeCell.png?raw=true" alt="Code cell" style="width: 800px;"/>

You can click on this **cell** to select. If you happen to click on the text area, then you will see the cell selected with a *green* border.

<img src="https://github.com/sathayas/JupyterPythonFall2017/blob/master/images/Jupyter_CodeCellGreen.png?raw=true" alt="Selected code cell (green)" width="800">

Or if you happen to click anywhere else on the cell, the you will see the cell with a *blue* border.

<img src="https://github.com/sathayas/JupyterPythonFall2017/blob/master/images/Jupyter_CodeCellBlue.png?raw=true" alt="Selected code cell (blue)" width="800">

### Running a code cell
Once the cell is selected (blue or green), you can run that code by pressing the run button 

<img src="https://github.com/sathayas/JupyterPythonFall2017/blob/master/images/Jupyter_PlayButton.png?raw=true" alt="Run button" width="30">

within the browser window on the top. Then the code is executed, and the output is produced. You can see the output immediately below the code cell. 

<img 
src="https://github.com/sathayas/JupyterPythonFall2017/blob/master/images/Jupyter_Output.png?raw=true" alt="Output example" width="800">

Try it for yourself!

In [None]:
print("Hello World!")
yourname = '???????'  # Replace with your name
print("Hello, " + yourname + "!!")

### Editing and re-running a code cell
You can edit and re-run the code within the notebook as well. Just click on the text of the code cell (so that it's green) and make an edit. 

<img 
src="https://github.com/sathayas/JupyterPythonFall2017/blob/master/images/Jupyter_EditCell.png?raw=true" alt="Edit the code cell" width="800">

Here, I am changing the variable **`yourname`** from **`Satoru`** to **`Hayasaka`**. Then I can just re-run the code by clicking on the run button again.

<img 
src="https://github.com/sathayas/JupyterPythonFall2017/blob/master/images/Jupyter_RerunCell.png?raw=true" alt="Output from a modified code" width="800">

Jupyter notebook by default auto-saves any edits frequently. You can also click on the Save button to save any changes made to your Jupyter notebook.

### Closing a Jupyter notebook
When you are done with a Jupyter notebook, there are multiple ways to **close** it. This is the way I do it:

  1. Go to Terminal or Command Prompt, and press CTRL-C.
  2. When asked to shutdown the notebook server, answer **`y`**.
  3. You should see a message saying that the kernel is dead. You just click on **Don't Restart** at that point. 
  4. Close the browser window associated with the Jupyter notebook.
  5. If you do not see a prompt in the Terminal or Command Prompt window, press CTRL-C one more time.

# 4. Python refresher
<hr style="height:1px;border:none" />

*For those using Jupyter Notebook, please run this line of code so that any plots generated during the class will be visible in your notebook document.*

In [None]:
%matplotlib inline

## Reading a data frame
The file **`co-est2016-alldata.csv`**, available on GitHub, contains the estimated population of the past 5 years of all counties in 50 states and the District of Columbia. Let's read this file as a data frame using Pandas.

`<ReadCensusData.py>`

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# loading the county-level census data
ctyData = pd.read_csv('co-est2016-alldata.csv',
                      encoding = 'iso-8859-1')

In each state, variable **`COUNTY==0`** indicates the state total (i.e., the total of all counties in that state). So we create another data frame with the state totals only.

In [None]:
# Focusing on state totals (COUNTY==0)
stateData = ctyData[ctyData.COUNTY==0].copy()

At this point, we sort states according to their population, in the descending order.

In [None]:
# Sorting states in the order of population
stateData.sort_values(by='CENSUS2010POP',
                      ascending=False, inplace=True)

Finally printing out states and their population.

In [None]:
# printing out total population by state
stateNames = np.array(stateData.STNAME)
statePop = np.array(stateData.CENSUS2010POP)
print('State\t\t\tPopulation')
for i,iState in enumerate(stateNames):
    print('%-20s' % iState + '\t%10d' % statePop[i])


## More on NumPy arrays

The file **`LandAreaByState.npz`** is a NumPy data file containing two arrays: **`State`** (name of states) and **`LandArea`** (land area of states). Let's start out by reading in the file. 

In [None]:
# loading the data on state land area
npzData = np.load('LandAreaByState.npz')
State = npzData['State']
LandArea = npzData['LandArea']

You can examine the content of the file by

In [None]:
npzData.items()

The order of the states in this file is different from that of the data frame we created earlier. So we create another array of the population in the same order as `LandArea`.  

In [None]:
# population data in the same order as the land area data
Population = np.array([])
for iState in State:
    Population = np.append(Population, statePop[stateNames==iState])

Then we just plot the area against the population.

In [None]:
# plotting the state area vs population
plt.plot(LandArea, Population, 'g^')
plt.title('State area vs. population')
plt.xlabel('Area [sq Mi]')
plt.ylabel('Population')
plt.show()