# Python for R users - Learn by doing

## Part 1: Pandas & Basic Syntax
### Basics and importing Pandas
  

Welcome! This will be my attempt to share a little of what I know about python for those (like myself) whom primarily use R for data wrangling, reporting, and analysis. Python is a general purpose language being used in many from web development, to deep learning. Since it is versitile, the language doesn't come 'data analysis ready' right out of the box as much as R does. For example, there is not concept of a data frame in 'base' python as there is in base R. In order to gain these capibilities we will need to import the pandas library (library is analogous to a package in R).

We will do that with an import statement as so:

In [4]:
import pandas

`import pandas` is nearly the same as calling `library(pandas)`
  
Or, more commonly we can use the following:

In [2]:
import pandas as pd

The second part of the above statement `as pd` nicknames pandas as pd to save us some typing :)

The pandas library contains a function `read_csv` which reads csv files into a data frame, just like R. Similarly to R, we can call functions specifically from a library. Like `reader::read_csv`, we can use `pandas.read_csv`, or in this example because we imported pandas `as pd`, we can use: `pd.read_csv`.  We will access by typing `pd.read_csv('csvfile.csv')` which is analagous to, in R, `reader::read_csv('csvfile.csv')`. An important difference is these functions are not simply dumped into the global namespace like in R where `library(readr)` would make available all of readr's functions globably. This can cause some issues, but luckily R usually informs you of this (if you've ever tried to use dplyr you've seen such a warning about the select function). 

In [3]:
mtcars = pd.read_csv('mtcars.csv')


While the first argument of the read_csv function is the path to the data, there are more too (however, are not required). To get help for a function, type: '?function'. In this case we would type: `?pd.read_csv` to hear all about the function its arguemnets, etc. This is nearly identical to typying `package::functionname?` in R.

In [8]:
?pd.read_csv

Great! Now we have our data, now what? Well, lets explore. First, we can look at the first few rows by using the head method.

In [9]:
mtcars.head()

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


And the tail method.

In [6]:
mtcars.tail()

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
27,Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2
28,Ford Pantera L,15.8,8,351.0,264,4.22,3.17,14.5,0,1,5,4
29,Ferrari Dino,19.7,6,145.0,175,3.62,2.77,15.5,0,1,5,6
30,Maserati Bora,15.0,8,301.0,335,3.54,3.57,14.6,0,1,5,8
31,Volvo 142E,21.4,4,121.0,109,4.11,2.78,18.6,1,1,4,2


or what if we want some summary statistics?

In [5]:
mtcars.describe()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
count,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0
mean,20.090625,6.1875,230.721875,146.6875,3.596563,3.21725,17.84875,0.4375,0.40625,3.6875,2.8125
std,6.026948,1.785922,123.938694,68.562868,0.534679,0.978457,1.786943,0.504016,0.498991,0.737804,1.6152
min,10.4,4.0,71.1,52.0,2.76,1.513,14.5,0.0,0.0,3.0,1.0
25%,15.425,4.0,120.825,96.5,3.08,2.58125,16.8925,0.0,0.0,3.0,2.0
50%,19.2,6.0,196.3,123.0,3.695,3.325,17.71,0.0,0.0,4.0,2.0
75%,22.8,8.0,326.0,180.0,3.92,3.61,18.9,1.0,1.0,4.0,4.0
max,33.9,8.0,472.0,335.0,4.93,5.424,22.9,1.0,1.0,5.0,8.0


Awesome!

You may have noticed that in R, we would have used functions like `head(mtcars)`, `tail(mtcars)`, or `summary(mtcars)`. In python, it is common to see the function call following the object and a dot. This is utilziing an object's method. In these cases, our object is our pandas data frame, and out method is the function (head, tail, describe, etc.). We can think of a method as a function that “belongs to” an object. We couldn't just 

An important part of any work with data frames is being able to look at certain columns or rows.
  
Luckily, pandas' data frames have some methods for us to use!

Firstly, we can pass a list to our data frame to select the columns by name likeso:

In [15]:
mtcars[['mpg', 'hp']].head()

Unnamed: 0,mpg,hp
0,21.0,110
1,21.0,110
2,22.8,93
3,21.4,110
4,18.7,175


There are also special methods which help us select data, `.loc` for label based indexing or
`.iloc` for positional indexing

In [8]:
mtcars.loc[:5,['model',"cyl",'hp']]

Unnamed: 0,model,cyl,hp
0,Mazda RX4,6,110
1,Mazda RX4 Wag,6,110
2,Datsun 710,4,93
3,Hornet 4 Drive,6,110
4,Hornet Sportabout,8,175
5,Valiant,6,105


In [19]:
mtcars.iloc[[0,1],[0,1]]

Unnamed: 0,model,mpg
0,Mazda RX4,21.0
1,Mazda RX4 Wag,21.0


In [25]:
mtcars.iloc[[0,1],[0,1]]

Unnamed: 0,model,mpg
0,Mazda RX4,21.0
1,Mazda RX4 Wag,21.0


In [26]:
mtcars["mpg"].mean()

20.090624999999996

In [18]:
mtcars.mean()

mpg      20.090625
cyl       6.187500
disp    230.721875
hp      146.687500
drat      3.596563
wt        3.217250
qsec     17.848750
vs        0.437500
am        0.406250
gear      3.687500
carb      2.812500
dtype: float64

In [21]:
mtcars["mpg"] * 0.425144 # to km/l

0      8.928024
1      8.928024
2      9.693283
3      9.098082
4      7.950193
5      7.695106
6      6.079559
7     10.373514
8      9.693283
9      8.162765
10     7.567563
11     6.972362
12     7.354991
13     6.462189
14     4.421498
15     4.421498
16     6.249617
17    13.774666
18    12.924378
19    14.412382
20     9.140596
21     6.589732
22     6.462189
23     5.654415
24     8.162765
25    11.606431
26    11.053744
27    12.924378
28     6.717275
29     8.375337
30     6.377160
31     9.098082
Name: mpg, dtype: float64

In [22]:
mtcars[mtcars['mpg'] > 25]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
17,Fiat 128,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1
18,Honda Civic,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2
19,Toyota Corolla,33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1
25,Fiat X1-9,27.3,4,79.0,66,4.08,1.935,18.9,1,1,4,1
26,Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
27,Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2


We know one of the most powerful functions in R is the `apply` statment and its sister functions. Pandas has defined a similar method for dataframes conveniently called apply. the first argument we invoke is the function, followed by the index (the data argument is by default the dataframe the apply statement is stemming from, we can think of it almost like a pipe).
Type `?pd.DataFrame.apply` for more!

In [36]:
mtcars[['mpg', 'hp']].apply(sum,1)

0     131.0
1     131.0
2     115.8
3     131.4
4     193.7
5     123.1
6     259.3
7      86.4
8     117.8
9     142.2
10    140.8
11    196.4
12    197.3
13    195.2
14    215.4
15    225.4
16    244.7
17     98.4
18     82.4
19     98.9
20    118.5
21    165.5
22    165.2
23    258.3
24    194.2
25     93.3
26    117.0
27    143.4
28    279.8
29    194.7
30    350.0
31    130.4
dtype: float64

In [39]:
?pd.DataFrame.apply