# Importing Data in Python

Note that I have written these notes in a Jupyter Notebook. Jupyter Notebook is included as part of Anaconda. You can find information and examples at https://jupyter-notebook.readthedocs.io/en/stable/. Jupyter Notesbooks are a good tool for writing documents that use Python. However, I would not recommend using them for writing serious code. Spyder has much better tools for coding, debugging, etc. 

Next, we will learn how to import data in Python. The procedures are quite similar to R. First, we will need to import the **pandas** module. This is one of the primary modules in Python for doing statistics and data analysys. See the link below:


https://pandas.pydata.org/


I import **pandas** below. 


In [1]:
import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pyplot as plt

The **pandas** module has a command **read_csv** that can be used to read in csv files. We will start by reading in the gun data:

In [2]:
pd.read_csv("../data/full_data.csv")

Unnamed: 0.1,Unnamed: 0,year,month,intent,police,sex,age,race,hispanic,place,education
0,1,2012,1,Suicide,0,M,34.0,Asian/Pacific Islander,100,Home,BA+
1,2,2012,1,Suicide,0,F,21.0,White,100,Street,Some college
2,3,2012,1,Suicide,0,M,60.0,White,100,Other specified,BA+
3,4,2012,2,Suicide,0,M,64.0,White,100,Home,BA+
4,5,2012,2,Suicide,0,M,31.0,White,100,Other specified,HS/GED
...,...,...,...,...,...,...,...,...,...,...,...
100793,100794,2014,12,Homicide,0,M,36.0,Black,100,Home,HS/GED
100794,100795,2014,12,Homicide,0,M,19.0,Black,100,Street,HS/GED
100795,100796,2014,12,Homicide,0,M,20.0,Black,100,Street,HS/GED
100796,100797,2014,12,Homicide,0,M,22.0,Hispanic,260,Street,Less than HS


This is very similar to R. See the help page for **read_csv** for details:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv


The options are very similar to **read_csv** and **read.csv** in R, although somewhat less convenient. There are not built-in functions for tab delimited or space delimited files like in R. However, you can read in these file types using the **sep** option. 

**pandas.read_csv** reads in the data as a *pandas* data frame. This has similar properties as data frames in R (in fact, *pandas* was designed to emulate R). 

In the code above, I read in the guns data without actually saving it to a data frame. Now, I will do that: 




In [3]:
guns=pd.read_csv("../data/full_data.csv")

We can find the dimension of the data as follows: 

In [4]:
guns.shape

(100798, 11)

We can access columns of the data frame with the syntax **dataframename.colname**. For example, we can access the **intent** column as shown below. 

In [5]:
guns.intent

0          Suicide
1          Suicide
2          Suicide
3          Suicide
4          Suicide
            ...   
100793    Homicide
100794    Homicide
100795    Homicide
100796    Homicide
100797    Homicide
Name: intent, Length: 100798, dtype: object

We can access parts of a column in a manner similar to R: 

In [6]:
guns.intent[0:9]

0         Suicide
1         Suicide
2         Suicide
3         Suicide
4         Suicide
5         Suicide
6    Undetermined
7         Suicide
8      Accidental
Name: intent, dtype: object

Note that Python indexes from 0, whereas R indexes from 1 !!!!!!!!!!!!!!!!!!!!!!!!!!!

We can access particular elements as follows: 

In [7]:
guns.intent[[2,3,10]]

2     Suicide
3     Suicide
10    Suicide
Name: intent, dtype: object

We can also do the following:

In [8]:
guns['intent'][0:10]


0         Suicide
1         Suicide
2         Suicide
3         Suicide
4         Suicide
5         Suicide
6    Undetermined
7         Suicide
8      Accidental
9         Suicide
Name: intent, dtype: object

We will talk much more about accessing data frames as we progress through the semester. 
### Working Directory and File Paths
Like with R, we must specify a working directory. You do this by typing in or browsing to the desired directory in the box in the upper right corner of the Spyder inteface. In Jupyter Notebooks, the working directory is the same directory as the notebook by default. If you need to change it, you can easily google the correct procedure. 

When reading in data, you must specify a valid file path from the working directory to the directory where the data is. For example, in my code above, the file path is 

```"../data/full_data.csv"```

This says  
1. Start in the current directory (`../`).  
2. Go up three directory levels (`../../../`)  
3. Go to the folder `STAT4365`.  
4. Within `STAT4365` look for the folder `Data Sets`.  
5. Within the folder `Data Sets` look for the folder `guns-master-data`.  
6. Within the folder `guns-master-data` look for the file called `full_data.csv`. 

This is known as a **relative** file path because it gives the path relative to the working director. There are also **absolute** file paths that give the path starting from the root directory. The absolute path for the guns data on my computer is 

```"C:/Users/pdschlie/Dropbox/courses/STAT4365/Data Sets/guns-data-master/full_data.csv"```

Because I use cloud storage (Dropbox) and use my files on multiple computers, I must use relative file paths. The absolute file path is not the same on my different computers (because Dropbox is in different locations) but the relative file path is always the same because everything (code and data) is contained within Dropbox. 



Next, let's do an exercise. 

**Exercise:**   
1) Read in the College Scorecard data, accounting for missing data in the form of NULL and Privacy_Suppressed.  
2) Inspect the admission rate column of your data frame to see whether missing values were handled correctly.   
3) Read in both_sexes_for Exercise.csv, accounting for the extra text at the top. This data set is in the Marriage Data folder on eLC.  
4) Inspect the data frame to ensure that it was read in correctly.   


For solutions, see the file ExerciseSolutions_ImportingData.py on eLC. 


We will talk extensively about making plots, but here is a quick example:

In [9]:
guns.age.plot(kind='hist')

ImportError: matplotlib is required for plotting when the default backend "matplotlib" is selected.

**Exercise:**  

Make a histogram of admissions rates in the College Scorecard data