
<p>This lab on the Introduction to R comes from "Introduction to Statistical Learning with Applications in R" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. It was re-implemented in Fall 2017 in python by R. Jordan Crouser at Smith College.</p>



<h1 id="Loading-Data">Loading Data<a class="anchor-link" href="#Loading-Data">¶</a></h1>



<p>For most analyses, the first step involves importing a data set into <code>python</code>. For this class, a lot of the data comes from the <code>ISLR</code> package. Unfortunately this isn't available for <code>python</code> so I've exported the data to CSV to make things easier. We can use the <code>read_csv()</code> function from the <code>pandas</code> library to import it.</p>
<p>We begin by loading in the <code>Auto</code> data set.</p>


In [1]:

%matplotlib inline
import pandas as pd
!pip install ggplot

Auto = pd.read_csv("https://raw.githubusercontent.com/serivan/mldmlab/master/Datasets/Auto.csv")



Traceback (most recent call last):
  File "/home/iserina/.local/bin/pip", line 7, in <module>
    from pip._internal import main
ModuleNotFoundError: No module named 'pip._internal'



<p>Nothing happens when you run this, but now the data is available in your environment.</p>
<p>To view the data, we can either print the entire dataset by typing its name, or we can just look at the first few rows with the <code>head()</code> function.</p>


In [2]:

Auto.head()



Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino



<p>Now that we have the data, we can begin to learn things about it. For example, if we want to know how many rows and columns the DataFrame contains:</p>


In [3]:

Auto.shape



(392, 9)


<p>This tells us that the data has 392 observations, or rows, and nine variables, or columns.</p>
<p>The ${\tt .dtypes}$ atribute tells us that most of the variables are numeric or integer, although the ${\tt name }$ variable is a character vector.</p>


In [4]:

Auto.dtypes



mpg             float64
cylinders         int64
displacement    float64
horsepower        int64
weight            int64
acceleration    float64
year              int64
origin            int64
name             object
dtype: object


<h1 id="Summary-statistics">Summary statistics<a class="anchor-link" href="#Summary-statistics">¶</a></h1>



<p>Often, we want to know some basic things about variables in our data. Calling the <code>describe()</code> method on a DataFrame will give you an idea of some of the distributions of your variables.</p>



<p>The ${\tt describe()}$ function produces a numerical summary of each (quantitative) variable in
a particular data set.</p>


In [5]:

Auto.describe()



Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
count,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0
mean,23.445918,5.471939,194.41199,104.469388,2977.584184,15.541327,75.979592,1.576531
std,7.805007,1.705783,104.644004,38.49116,849.40256,2.758864,3.683737,0.805518
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0,1.0
25%,17.0,4.0,105.0,75.0,2225.25,13.775,73.0,1.0
50%,22.75,4.0,151.0,93.5,2803.5,15.5,76.0,1.0
75%,29.0,8.0,275.75,126.0,3614.75,17.025,79.0,2.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0,3.0



<p>The summary suggests that <code>origin</code> might be better thought of as a factor. It only seems to have three possible values, <code>1</code>, <code>2</code> and <code>3</code>. If we read the documentation about the data we will learn that these numbers correspond to where the car is from: 1. American, 2. European, 3. Japanese. So let's cast that variable into a categorical variable using using the  <code>astype()</code> function .</p>


In [6]:

Auto["origin"] = Auto["origin"].astype('category')




<p>If we want to include a summary of this variable when we call <code>.describe()</code>, we need to let <code>python</code> know we want ALL the variables (not just the quantitative ones):</p>


In [7]:

Auto.describe(include='all')



Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
count,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392
unique,,,,,,,,3.0,301
top,,,,,,,,1.0,amc matador
freq,,,,,,,,245.0,5
mean,23.445918,5.471939,194.41199,104.469388,2977.584184,15.541327,75.979592,,
std,7.805007,1.705783,104.644004,38.49116,849.40256,2.758864,3.683737,,
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0,,
25%,17.0,4.0,105.0,75.0,2225.25,13.775,73.0,,
50%,22.75,4.0,151.0,93.5,2803.5,15.5,76.0,,
75%,29.0,8.0,275.75,126.0,3614.75,17.025,79.0,,



<p>Or, just look at one particular statistic using <code>mean()</code>, <code>std()</code>, <code>median()</code>, and more using the <code>numpy</code> library:</p>


In [8]:

import numpy as np
np.mean(Auto['displacement'])



194.41198979591837


<h1 id="Plotting">Plotting<a class="anchor-link" href="#Plotting">¶</a></h1>



<p>As in <code>R</code>, we can use the <code>ggplot</code> package to produce simple graphics. <code>ggplot</code> has a particular syntax, which looks like this</p>


In [9]:

from ggplot import *

ggplot(Auto, aes(x='cylinders', y='mpg')) + \
    geom_point()



AttributeError: module 'pandas' has no attribute 'tslib'


<p>The basic idea is that you need to initialize a plot with <code>ggplot()</code> and then add "geoms" (short for geometric objects) to the plot. The <code>ggplot</code> package is based on the <a href="https://www.google.com/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=1&amp;cad=rja&amp;uact=8&amp;ved=0ahUKEwjV6I6F4ILPAhUFPT4KHTFiBwgQFggcMAA&amp;url=https%3A%2F%2Fwww.amazon.com%2FGrammar-Graphics-Statistics-Computing%2Fdp%2F0387245448&amp;usg=AFQjCNF5D6H3ySCsgqBTdp96KNF3bGyU2Q&amp;sig2=GnNgoN6Ztn3AJSTJYaMPwA">Grammar of Graphics</a>, a famous book on data visualization theory. It is a way to map attributes in your data (like variables) to "aesthetics" on the plot. The parameter <code>aes()</code> is short for aesthetic.</p>
<p>For more about the <code>ggplot2</code> syntax, view the documentation using the <code>help()</code> function. There are also great online resources for <code>ggplot2</code>, like <a href="http://ggplot.yhathq.com/">ggplot from ŷhat</a>.</p>


In [None]:

help(ggplot)




<p>The <code>cylinders</code> variable is stored as a numeric vector, so <code>python</code> has treated it
as quantitative. However, since there are only a small number of possible
values for cylinders, one may prefer to treat it as a qualitative variable.
We can turn it into a factor, again using an <code>astype()</code> call.</p>


In [None]:

Auto["cylinders"] = Auto["cylinders"].astype('category')




<p>To view the relationship between a categorical and a numeric variable, we might want to produce <em>boxplots</em>. As usual, a number of options can be specified in order to customize the plots.</p>


In [None]:

ggplot(Auto, aes(x='cylinders', y='mpg')) + \
    geom_boxplot() + \
    xlab("Cylinders") + \
    ylab("MPG")




<p>The geom <code>geom_histogram()</code> can be used to plot a histogram.</p>


In [None]:

ggplot(Auto, aes(x='mpg')) + \
    geom_histogram()




<p>The function warns us that it used a default number of bins, so we should think more carefully about what value makes sense.</p>


In [None]:

ggplot(Auto, aes(x='mpg')) + \
    geom_histogram(binwidth=5) 




<p>For small datasets, we might want to see all the bivariate relationships between the variables. The <code>pandas</code> package has a <code>scatter_matrix()</code> function that can do just that. (Be patient-- it takes a long time!)</p>


In [None]:

pd.scatter_matrix(Auto, alpha=0.2, figsize=(10, 10))




<p>Sometimes, we might want to save a plot for use outside of our Jupyter notebook. To do this, we call the plot's <code>save()</code> function.</p>


In [None]:

p = ggplot(Auto, aes(x='mpg')) + \
    geom_histogram(binwidth=5)

p.save(filename = "histogram.png")

