# Course: Intro to Python & R for Data Analysis
## Lecture: New R Kids on the Block


Professor: Mary Kaltenberg

Fall 2020

contact: mkaltenberg@pace.edu

About me: www.mkaltenberg.com

Python and R have a lot of similarities in the way they operate, but there are some slight differences.

Remember how I said computer languages are a lot like languages? Well, you're about to become bilingual.

Depending on what your most comfortable with, you may think about how to say it in your primary language and then translate it to your secondar/tertiary language. 

Python and R are kind of like Italian and Spanish - they're different, but if you know one really well, learning the other language is not super hard. It will take work to master it, of course, but the translation is similar enough that you can figure it out quickly if you can find the write words you need to use. 

A great resource that is listed in the syllabus is R for Data Science that is freely available online [here](https://r4ds.had.co.nz/) - I heavily rely on this in the lecture. And another good option YaRrr! the Pirates Guide to R is [here](https://bookdown.org/ndphillips/YaRrr/order-sorting-data.html)

First, let's deal with install issues that will surely happen.

Anaconda had another update (of course). Don't update it unless we have to (it will break stuff, probably). 

To connect R to Jupyter notebook (if you haven't already) - [these](https://irkernel.github.io/installation/#source-panel) instructions are useful. You can also use RStudio - I think it's fantastic and a great interface for R. But, for simplicity and for my own preference, I'm going to use R within jupyter notebook.

Let's take a moment to see if you got your R kernel's to start - and for those that haven't we can troubleshoot offline (but, quickly say the error as I might have fixed it before)

<img src = 'https://media.giphy.com/media/FspLvJQlQACXu/giphy.gif' width = 250>

## CRAN & Mirrors
R codes are stored in a series of servers across the world through Comprehensive R Archive Network (CRAN) - all of R related information is stored separately in each server (with the same exact information)

So, likely, the first thing you'll need to do is to use a mirror near you - a list of mirrors can be found [here](https://cran.r-project.org/mirrors.html)

Once you set it, it's done. No need to go back to this step unless for some reason you want to pull from a different server (maybe if you move, one that is closer to you)


In [3]:
# you can set your CRAN (in this case I am using Iowa State)
options("repos" = c(CRAN = "https://mirror.las.iastate.edu/CRAN/"))

Like in Python, R have "packages" - rather than using import, we call packages throuhg "library" Such as:

` library(package_name) `

As usual, you have to call the package that you want in your notebook.

In a moment, we're going to start working with tidyverse - this is the pandas of R.

But, just to see the importing of a package, let's try it here.

In [49]:
library(tidyverse)   

In [50]:
library(dplyr)  

If you want to install a package you can do so by:
`install.packages('pacakagename')`

!!!! HOWEVER !!!! Anaconda has a bunch of issues when doing this. So, to avoid breaking it, go to your terminal/shell to update packages and dependencies:

TO update dependencies and packages:
`conda update r-caret`

For this class do this:
`conda metapackage custom-r-bundle 0.1.0 --dependencies r-irkernel jupyter r-ggplot2 r-dplyr`

This will update a bunch of custom packages that we need.


For those having issues where an "image is not found" as an error in your terminal

Try running these commands in your **Terminal**:

First:
`conda create -n r_env r-essentials r-base`

Then:
`conda activate r_env`

Then launch jupyter notebook:
`jupyter notebook`

Check to see that you can connect and open the R environment successfully.

Sometimes trying:
`conda update -c rdonnellyr -c main --all` - also does the trick

## Basic Commands
Before we get to working with data, let's do a quick over view of some basic commands

To change the working directory we use:

`setwd(\you/working/directory/)`

to find out where you are in your working directory we use:

`getwd()'

In [10]:
getwd()

In [9]:
setwd('/Users/mkaltenberg/Documents/Data Analysis Python R Lectures/Data_Analysis_Python_R/Lecture_8/')

As in python, you can assign variables to objects, but usually you see this done with`<-`
In R documentation they almost always use `<-` (you can thank biologists for this)

Though, `=` will also work.

In [11]:
x <- 'hi there!'
x

y = 'bye!'
y


While we are used to printing out items by including the object itself at the end of using print() in python, R makes this easier by just include parenthesis around the thing you want to print - can be done with any object

In [13]:
(y)

instead of range, R uses the function seq. You can find nearly every equivalent function that exists in python in R.
You can change a variable name just as easily be reassigning the variable name.

In [14]:
(y <- seq(0, 10, 2))

## Import/Export data

To read a csv file you use

`read.csv(path_to_csv_file)`


To save a csv file you use
`write.csv(df, path_to_csv_file)`

In [3]:
setwd('/Users/mkaltenberg/Documents/Data Analysis Python R Lectures/Data_Analysis_Python_R/')
read.csv('Lecture_8/job-automation-probability.csv')

X_...rank,X_...code,prob,Average.annual.wage,education,occupation,short.occupation,len,probability,numbEmployed,median_ann_wage,employed_may2016,average_ann_wage
624,51-4033,0.9500,34920,High school diploma or equivalent,"Grinding, Lapping, Polishing and Buffing Machine Tool Setters, Operators and Tenders, Metal and Plastic","Tool setters, operators and tenders",35,0.9500,74600,32890,74600,34920
517,51-9012,0.8800,41450,High school diploma or equivalent,"Separating, Filtering, Clarifying, Precipitating and Still Machine Setters, Operators and Tenders","Tool setters, operators and tenders",35,0.8800,47160,38360,47160,41450
484,41-4012,0.8500,68410,High school diploma or equivalent,"Sales Representatives, Wholesale and Manufacturing, Except Technical and Scientific Products","Sales Representatives, Wholesale and Manufacturing",92,0.8500,1404050,57140,1404050,68410
105,53-1031,0.0290,59800,High school diploma or equivalent,First-Line Supervisors of Transportation and Material-Moving Machine and Vehicle Operators,Supervisors Transportation,26,0.0290,202760,57270,202760,59800
620,51-4072,0.9500,32660,High school diploma or equivalent,"Molding, Coremaking and Casting Machine Setters, Operators and Tenders, Metal and Plastic","Molding, Coremaking and Casting Machine Setters, Operators and Tenders, Metal and Plastic",89,0.9500,145560,30480,145560,32660
518,51-6091,0.8800,35420,High school diploma or equivalent,"Extruding and Forming Machine Setters, Operators and Tenders, Synthetic and Glass Fibers","Extruding and Forming Machine Setters, Operators and Tenders, Synthetic and Glass Fibers",88,0.8800,19340,34240,19340,35420
427,51-4031,0.7800,34210,High school diploma or equivalent,"Cutting, Punching and Press Machine Setters, Operators and Tenders, Metal and Plastic","Cutting, Punching and Press Machine Setters, Operators and Tenders, Metal and Plastic",85,0.7800,192800,32370,192800,34210
228,41-4011,0.2500,92910,Bachelor's degree,"Sales Representatives, Wholesale and Manufacturing, Technical and Scientific Products","Sales Representatives, Wholesale and Manufacturing, Technical and Scientific Products",85,0.2500,328370,78980,328370,92910
590,51-4032,0.9400,38880,High school diploma or equivalent,"Drilling and Boring Machine Tool Setters, Operators and Tenders, Metal and Plastic","Drilling and Boring Machine Tool Setters, Operators and Tenders, Metal and Plastic",82,0.9400,12290,36410,12290,38880
584,51-9041,0.9300,34370,High school diploma or equivalent,"Extruding, Forming, Pressing and Compacting Machine Setters, Operators and Tenders","Extruding, Forming, Pressing and Compacting Machine Setters, Operators and Tenders",82,0.9300,71260,32510,71260,34370


In [24]:
# we can see that the dataframe is now named jobs
jobs <- read.csv('Lecture_8/job-automation-probability.csv')

In [5]:
#to export the dataset
write.csv(jobs, 'Lecture_8/jobs2.csv')

## Help

Like python, you can always ask for documentation, but that function is:
`help()`

In [6]:
help(seq)

For packages, there is also a summary about the package and what it does with `vignette`

In [None]:
vignette('dplyr')

## Reserved Words

Like in Python, some words are best to never use (so you don't override core programs in R)

- if 
- else 
- while 
- function 
- for
- TRUE 
- FALSE 
- NULL 
- Inf 
- NaN 
- NA
- c() <- especially this one, don't use it!

A full list can be found [here](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html)

## Removing objects

As you know, we have a bunch of objects that are clogging up our RAM. If you want to remove an object the function is: 
`rm()`

In [21]:
rm(x,y)

## Data Manipulation

I'll show some of the commands we used in python:
- Pick variables by their names (select())
    - select: Select (i.e. subset) columns by their names
- Dropping variables by their names
- Pick observations by their values (filter())
    - filter: Filter (i.e. subset) rows based on their values.
- Reorder the rows (arrange())
    - arrange: Arrange (i.e. reorder) rows based on their values.
- Create new variables with functions of existing variables (mutate())
    - mutate: Create new columns.
- Groupby and Summarise
    - groupby: group items by keys
    - summarise: Collapse multiple rows into a single summary value.















We'll be using the package dplyr to help with data manipulation

In [25]:
# To get the names of the columns in the data frame (python = jobs.columns)
# name(df)
names(jobs)

Generally, you're going to tell R:
1. what the dataframe you are manipulation is
2. then the function you want to do

### Selecting Columns

In [26]:
#Select some of the columns
jobs %>% select(c('X_...code', 'prob', 'Average.annual.wage', 'education', 'numbEmployed'))

X_...code,prob,Average.annual.wage,education,numbEmployed
51-4033,0.9500,34920,High school diploma or equivalent,74600
51-9012,0.8800,41450,High school diploma or equivalent,47160
41-4012,0.8500,68410,High school diploma or equivalent,1404050
53-1031,0.0290,59800,High school diploma or equivalent,202760
51-4072,0.9500,32660,High school diploma or equivalent,145560
51-6091,0.8800,35420,High school diploma or equivalent,19340
51-4031,0.7800,34210,High school diploma or equivalent,192800
41-4011,0.2500,92910,Bachelor's degree,328370
51-4032,0.9400,38880,High school diploma or equivalent,12290
51-9041,0.9300,34370,High school diploma or equivalent,71260


In [27]:
# A more simplified way to do this is:
select(jobs, c('X_...code', 'prob', 'Average.annual.wage', 'education', 'numbEmployed'))

X_...code,prob,Average.annual.wage,education,numbEmployed
51-4033,0.9500,34920,High school diploma or equivalent,74600
51-9012,0.8800,41450,High school diploma or equivalent,47160
41-4012,0.8500,68410,High school diploma or equivalent,1404050
53-1031,0.0290,59800,High school diploma or equivalent,202760
51-4072,0.9500,32660,High school diploma or equivalent,145560
51-6091,0.8800,35420,High school diploma or equivalent,19340
51-4031,0.7800,34210,High school diploma or equivalent,192800
41-4011,0.2500,92910,Bachelor's degree,328370
51-4032,0.9400,38880,High school diploma or equivalent,12290
51-9041,0.9300,34370,High school diploma or equivalent,71260


There are multiple ways of calling a dataframe and applying a function.

The first way is:
`df %>%  select(c(col_names)`

So, this funny thing `%>%` (called a pipe) is saying that I am going to work with the dataframe named df and I want you to apply a function called select.


I find it much more intuitive to use:
`select(df, c(column_names))`

Where, I have a function called select and I'm telling it that the dataframe name df is what I will apply the function select to. 

Because I have a preference, we'll stick to the latter form in the rest of the lecture - but, when you look at stack overflow and get confused as to other notation, recall that it's the same-ish.

<img src = 'https://media.giphy.com/media/xT9Igj9Vh5mjLl6ZW0/giphy.gif' width = 300>


In [28]:
# naming it a new dataframe (two ways to accomplish the same thing)
jobs_wages<-jobs %>% select(c('X_...code', 'prob', 'Average.annual.wage', 'education', 'numbEmployed'))
jobs_wages<- select(jobs, c('X_...code', 'prob', 'Average.annual.wage', 'education', 'numbEmployed'))

### Dropping columns

In [29]:
# you just include the negative sign before the column list, and that will drop the selected columns you listed
select(jobs, -c('probability','X_...rank','employed_may2016' ,'average_ann_wage','len'))

X_...code,prob,Average.annual.wage,education,occupation,short.occupation,numbEmployed,median_ann_wage
51-4033,0.9500,34920,High school diploma or equivalent,"Grinding, Lapping, Polishing and Buffing Machine Tool Setters, Operators and Tenders, Metal and Plastic","Tool setters, operators and tenders",74600,32890
51-9012,0.8800,41450,High school diploma or equivalent,"Separating, Filtering, Clarifying, Precipitating and Still Machine Setters, Operators and Tenders","Tool setters, operators and tenders",47160,38360
41-4012,0.8500,68410,High school diploma or equivalent,"Sales Representatives, Wholesale and Manufacturing, Except Technical and Scientific Products","Sales Representatives, Wholesale and Manufacturing",1404050,57140
53-1031,0.0290,59800,High school diploma or equivalent,First-Line Supervisors of Transportation and Material-Moving Machine and Vehicle Operators,Supervisors Transportation,202760,57270
51-4072,0.9500,32660,High school diploma or equivalent,"Molding, Coremaking and Casting Machine Setters, Operators and Tenders, Metal and Plastic","Molding, Coremaking and Casting Machine Setters, Operators and Tenders, Metal and Plastic",145560,30480
51-6091,0.8800,35420,High school diploma or equivalent,"Extruding and Forming Machine Setters, Operators and Tenders, Synthetic and Glass Fibers","Extruding and Forming Machine Setters, Operators and Tenders, Synthetic and Glass Fibers",19340,34240
51-4031,0.7800,34210,High school diploma or equivalent,"Cutting, Punching and Press Machine Setters, Operators and Tenders, Metal and Plastic","Cutting, Punching and Press Machine Setters, Operators and Tenders, Metal and Plastic",192800,32370
41-4011,0.2500,92910,Bachelor's degree,"Sales Representatives, Wholesale and Manufacturing, Technical and Scientific Products","Sales Representatives, Wholesale and Manufacturing, Technical and Scientific Products",328370,78980
51-4032,0.9400,38880,High school diploma or equivalent,"Drilling and Boring Machine Tool Setters, Operators and Tenders, Metal and Plastic","Drilling and Boring Machine Tool Setters, Operators and Tenders, Metal and Plastic",12290,36410
51-9041,0.9300,34370,High school diploma or equivalent,"Extruding, Forming, Pressing and Compacting Machine Setters, Operators and Tenders","Extruding, Forming, Pressing and Compacting Machine Setters, Operators and Tenders",71260,32510


### Filtering

The same boolean operators in python (and any language) work in the same way.

This handy chart can help you figure out what boolean operators you want to use:

<img src ='boolean.png' width = 500>

In [30]:
# python
# jobs[jobs['prob']>.8]
# R
filter(jobs, prob >.8)

X_...rank,X_...code,prob,Average.annual.wage,education,occupation,short.occupation,len,probability,numbEmployed,median_ann_wage,employed_may2016,average_ann_wage
624,51-4033,0.95,34920,High school diploma or equivalent,"Grinding, Lapping, Polishing and Buffing Machine Tool Setters, Operators and Tenders, Metal and Plastic","Tool setters, operators and tenders",35,0.95,74600,32890,74600,34920
517,51-9012,0.88,41450,High school diploma or equivalent,"Separating, Filtering, Clarifying, Precipitating and Still Machine Setters, Operators and Tenders","Tool setters, operators and tenders",35,0.88,47160,38360,47160,41450
484,41-4012,0.85,68410,High school diploma or equivalent,"Sales Representatives, Wholesale and Manufacturing, Except Technical and Scientific Products","Sales Representatives, Wholesale and Manufacturing",92,0.85,1404050,57140,1404050,68410
620,51-4072,0.95,32660,High school diploma or equivalent,"Molding, Coremaking and Casting Machine Setters, Operators and Tenders, Metal and Plastic","Molding, Coremaking and Casting Machine Setters, Operators and Tenders, Metal and Plastic",89,0.95,145560,30480,145560,32660
518,51-6091,0.88,35420,High school diploma or equivalent,"Extruding and Forming Machine Setters, Operators and Tenders, Synthetic and Glass Fibers","Extruding and Forming Machine Setters, Operators and Tenders, Synthetic and Glass Fibers",88,0.88,19340,34240,19340,35420
590,51-4032,0.94,38880,High school diploma or equivalent,"Drilling and Boring Machine Tool Setters, Operators and Tenders, Metal and Plastic","Drilling and Boring Machine Tool Setters, Operators and Tenders, Metal and Plastic",82,0.94,12290,36410,12290,38880
584,51-9041,0.93,34370,High school diploma or equivalent,"Extruding, Forming, Pressing and Compacting Machine Setters, Operators and Tenders","Extruding, Forming, Pressing and Compacting Machine Setters, Operators and Tenders",82,0.93,71260,32510,71260,34370
477,51-4034,0.84,39630,High school diploma or equivalent,"Lathe and Turning Machine Tool Setters, Operators and Tenders, Metal and Plastic","Lathe and Turning Machine Tool Setters, Operators and Tenders, Metal and Plastic",80,0.84,33850,38480,33850,39630
560,51-4021,0.91,35340,High school diploma or equivalent,"Extruding and Drawing Machine Setters, Operators and Tenders, Metal and Plastic","Extruding and Drawing Machine Setters, Operators and Tenders, Metal and Plastic",79,0.91,71960,33870,71960,35340
637,51-6064,0.96,28110,High school diploma or equivalent,"TextileWinding, Twisting and Drawing Out Machine Setters, Operators and Tenders","TextileWinding, Twisting and Drawing Out Machine Setters, Operators and Tenders",79,0.96,30340,27500,30340,28110


In [31]:
# python
# jobs[jobs['education']=='High school diploma or equivalent']
filter(jobs, education == 'High school diploma or equivalent')

X_...rank,X_...code,prob,Average.annual.wage,education,occupation,short.occupation,len,probability,numbEmployed,median_ann_wage,employed_may2016,average_ann_wage
624,51-4033,0.950,34920,High school diploma or equivalent,"Grinding, Lapping, Polishing and Buffing Machine Tool Setters, Operators and Tenders, Metal and Plastic","Tool setters, operators and tenders",35,0.950,74600,32890,74600,34920
517,51-9012,0.880,41450,High school diploma or equivalent,"Separating, Filtering, Clarifying, Precipitating and Still Machine Setters, Operators and Tenders","Tool setters, operators and tenders",35,0.880,47160,38360,47160,41450
484,41-4012,0.850,68410,High school diploma or equivalent,"Sales Representatives, Wholesale and Manufacturing, Except Technical and Scientific Products","Sales Representatives, Wholesale and Manufacturing",92,0.850,1404050,57140,1404050,68410
105,53-1031,0.029,59800,High school diploma or equivalent,First-Line Supervisors of Transportation and Material-Moving Machine and Vehicle Operators,Supervisors Transportation,26,0.029,202760,57270,202760,59800
620,51-4072,0.950,32660,High school diploma or equivalent,"Molding, Coremaking and Casting Machine Setters, Operators and Tenders, Metal and Plastic","Molding, Coremaking and Casting Machine Setters, Operators and Tenders, Metal and Plastic",89,0.950,145560,30480,145560,32660
518,51-6091,0.880,35420,High school diploma or equivalent,"Extruding and Forming Machine Setters, Operators and Tenders, Synthetic and Glass Fibers","Extruding and Forming Machine Setters, Operators and Tenders, Synthetic and Glass Fibers",88,0.880,19340,34240,19340,35420
427,51-4031,0.780,34210,High school diploma or equivalent,"Cutting, Punching and Press Machine Setters, Operators and Tenders, Metal and Plastic","Cutting, Punching and Press Machine Setters, Operators and Tenders, Metal and Plastic",85,0.780,192800,32370,192800,34210
590,51-4032,0.940,38880,High school diploma or equivalent,"Drilling and Boring Machine Tool Setters, Operators and Tenders, Metal and Plastic","Drilling and Boring Machine Tool Setters, Operators and Tenders, Metal and Plastic",82,0.940,12290,36410,12290,38880
584,51-9041,0.930,34370,High school diploma or equivalent,"Extruding, Forming, Pressing and Compacting Machine Setters, Operators and Tenders","Extruding, Forming, Pressing and Compacting Machine Setters, Operators and Tenders",82,0.930,71260,32510,71260,34370
477,51-4034,0.840,39630,High school diploma or equivalent,"Lathe and Turning Machine Tool Setters, Operators and Tenders, Metal and Plastic","Lathe and Turning Machine Tool Setters, Operators and Tenders, Metal and Plastic",80,0.840,33850,38480,33850,39630


In [32]:
# In pandas, we often used a tilda (~) to exclude something, in R you use an exclamation mark (!)
#python
# jobs[~[(education == 'High school diploma or equivalent' | education =='No formal educational credential')]]

filter(jobs, !(education == 'High school diploma or equivalent' | education =='No formal educational credential'))


X_...rank,X_...code,prob,Average.annual.wage,education,occupation,short.occupation,len,probability,numbEmployed,median_ann_wage,employed_may2016,average_ann_wage
228,41-4011,0.2500,92910,Bachelor's degree,"Sales Representatives, Wholesale and Manufacturing, Technical and Scientific Products","Sales Representatives, Wholesale and Manufacturing, Technical and Scientific Products",85,0.2500,328370,78980,328370,92910
554,49-2093,0.9100,59840,Postsecondary nondegree award,"Electrical and Electronics Installers and Repairers, Transportation Equipment","Electrical and Electronics Installers and Repairers, Transportation Equipment",77,0.9100,13960,59280,13960,59840
208,15-1179,0.2100,67770,Associate's degree,"Information Security Analysts, Web Developers and Computer Network Architects","Information Security Analysts, Web Developers and Computer Network Architects",77,0.2100,188740,62670,188740,67770
254,49-2022,0.3600,54520,Postsecondary nondegree award,"Telecommunications Equipment Installers and Repairers, Except Line Installers",Telecommunications Equipment Installers and Repairers,77,0.3600,228430,53640,228430,54520
103,17-2111,0.0280,90190,Bachelor's degree,"Health and Safety Engineers, Except Mining Safety Engineers and Inspectors",Health and Safety Engineers,74,0.0280,25410,86720,25410,90190
205,25-3011,0.1900,55140,Bachelor's degree,Adult Basic and Secondary Education and Literacy Teachers and Instructors,Adult Basic and Secondary Education and Literacy Teachers and Instructors,73,0.1900,58810,50650,58810,55140
277,49-2094,0.4100,56990,Postsecondary nondegree award,"Electrical and Electronics Repairers, Commercial and Industrial Equipment","Electrical and Electronics Repairers, Commercial and Industrial Equipment",73,0.4100,67390,56250,67390,56990
41,25-2031,0.0078,61420,Bachelor's degree,"Secondary School Teachers, Except Special and Career/Technical Education",Secondary School Teachers,72,0.0078,1003250,58030,1003250,61420
261,49-2095,0.3800,74540,Postsecondary nondegree award,"Electrical and Electronics Repairers, Powerhouse, Substation and Relay","Electrical and Electronics Repairers, Powerhouse, Substation and Relay",70,0.3800,23060,75670,23060,74540
200,25-2022,0.1700,59800,Bachelor's degree,"Middle School Teachers, Except Special and Career/Technical Education",Middle School Teachers,69,0.1700,626310,56720,626310,59800


In [33]:
# and our good friend, is.na()
#in python    jobs[jobs['prob'].isnull()

filter(jobs, is.na(prob))

X_...rank,X_...code,prob,Average.annual.wage,education,occupation,short.occupation,len,probability,numbEmployed,median_ann_wage,employed_may2016,average_ann_wage


In [34]:
# and to drop na items
# python jobs['prob'].drop_na()

filter(jobs, !is.na(prob))

X_...rank,X_...code,prob,Average.annual.wage,education,occupation,short.occupation,len,probability,numbEmployed,median_ann_wage,employed_may2016,average_ann_wage
624,51-4033,0.9500,34920,High school diploma or equivalent,"Grinding, Lapping, Polishing and Buffing Machine Tool Setters, Operators and Tenders, Metal and Plastic","Tool setters, operators and tenders",35,0.9500,74600,32890,74600,34920
517,51-9012,0.8800,41450,High school diploma or equivalent,"Separating, Filtering, Clarifying, Precipitating and Still Machine Setters, Operators and Tenders","Tool setters, operators and tenders",35,0.8800,47160,38360,47160,41450
484,41-4012,0.8500,68410,High school diploma or equivalent,"Sales Representatives, Wholesale and Manufacturing, Except Technical and Scientific Products","Sales Representatives, Wholesale and Manufacturing",92,0.8500,1404050,57140,1404050,68410
105,53-1031,0.0290,59800,High school diploma or equivalent,First-Line Supervisors of Transportation and Material-Moving Machine and Vehicle Operators,Supervisors Transportation,26,0.0290,202760,57270,202760,59800
620,51-4072,0.9500,32660,High school diploma or equivalent,"Molding, Coremaking and Casting Machine Setters, Operators and Tenders, Metal and Plastic","Molding, Coremaking and Casting Machine Setters, Operators and Tenders, Metal and Plastic",89,0.9500,145560,30480,145560,32660
518,51-6091,0.8800,35420,High school diploma or equivalent,"Extruding and Forming Machine Setters, Operators and Tenders, Synthetic and Glass Fibers","Extruding and Forming Machine Setters, Operators and Tenders, Synthetic and Glass Fibers",88,0.8800,19340,34240,19340,35420
427,51-4031,0.7800,34210,High school diploma or equivalent,"Cutting, Punching and Press Machine Setters, Operators and Tenders, Metal and Plastic","Cutting, Punching and Press Machine Setters, Operators and Tenders, Metal and Plastic",85,0.7800,192800,32370,192800,34210
228,41-4011,0.2500,92910,Bachelor's degree,"Sales Representatives, Wholesale and Manufacturing, Technical and Scientific Products","Sales Representatives, Wholesale and Manufacturing, Technical and Scientific Products",85,0.2500,328370,78980,328370,92910
590,51-4032,0.9400,38880,High school diploma or equivalent,"Drilling and Boring Machine Tool Setters, Operators and Tenders, Metal and Plastic","Drilling and Boring Machine Tool Setters, Operators and Tenders, Metal and Plastic",82,0.9400,12290,36410,12290,38880
584,51-9041,0.9300,34370,High school diploma or equivalent,"Extruding, Forming, Pressing and Compacting Machine Setters, Operators and Tenders","Extruding, Forming, Pressing and Compacting Machine Setters, Operators and Tenders",82,0.9300,71260,32510,71260,34370


In [35]:
# And you can rename variables in place
#python equivalent jobs[['X_...code , X_...rank']].rename(columns={X_...code: 'code' , X_...rank:'rank'})

select(jobs, code=X_...code , rank= X_...rank)

code,rank
51-4033,624
51-9012,517
41-4012,484
53-1031,105
51-4072,620
51-6091,518
51-4031,427
41-4011,228
51-4032,590
51-9041,584


In [36]:
#to rename just select columns, but keep the whole dataframe
#python equivalent: jobs.rename(columns={X_...code: 'code' , X_...rank:'rank'})

rename(jobs, code=X_...code , rank= X_...rank)

rank,code,prob,Average.annual.wage,education,occupation,short.occupation,len,probability,numbEmployed,median_ann_wage,employed_may2016,average_ann_wage
624,51-4033,0.9500,34920,High school diploma or equivalent,"Grinding, Lapping, Polishing and Buffing Machine Tool Setters, Operators and Tenders, Metal and Plastic","Tool setters, operators and tenders",35,0.9500,74600,32890,74600,34920
517,51-9012,0.8800,41450,High school diploma or equivalent,"Separating, Filtering, Clarifying, Precipitating and Still Machine Setters, Operators and Tenders","Tool setters, operators and tenders",35,0.8800,47160,38360,47160,41450
484,41-4012,0.8500,68410,High school diploma or equivalent,"Sales Representatives, Wholesale and Manufacturing, Except Technical and Scientific Products","Sales Representatives, Wholesale and Manufacturing",92,0.8500,1404050,57140,1404050,68410
105,53-1031,0.0290,59800,High school diploma or equivalent,First-Line Supervisors of Transportation and Material-Moving Machine and Vehicle Operators,Supervisors Transportation,26,0.0290,202760,57270,202760,59800
620,51-4072,0.9500,32660,High school diploma or equivalent,"Molding, Coremaking and Casting Machine Setters, Operators and Tenders, Metal and Plastic","Molding, Coremaking and Casting Machine Setters, Operators and Tenders, Metal and Plastic",89,0.9500,145560,30480,145560,32660
518,51-6091,0.8800,35420,High school diploma or equivalent,"Extruding and Forming Machine Setters, Operators and Tenders, Synthetic and Glass Fibers","Extruding and Forming Machine Setters, Operators and Tenders, Synthetic and Glass Fibers",88,0.8800,19340,34240,19340,35420
427,51-4031,0.7800,34210,High school diploma or equivalent,"Cutting, Punching and Press Machine Setters, Operators and Tenders, Metal and Plastic","Cutting, Punching and Press Machine Setters, Operators and Tenders, Metal and Plastic",85,0.7800,192800,32370,192800,34210
228,41-4011,0.2500,92910,Bachelor's degree,"Sales Representatives, Wholesale and Manufacturing, Technical and Scientific Products","Sales Representatives, Wholesale and Manufacturing, Technical and Scientific Products",85,0.2500,328370,78980,328370,92910
590,51-4032,0.9400,38880,High school diploma or equivalent,"Drilling and Boring Machine Tool Setters, Operators and Tenders, Metal and Plastic","Drilling and Boring Machine Tool Setters, Operators and Tenders, Metal and Plastic",82,0.9400,12290,36410,12290,38880
584,51-9041,0.9300,34370,High school diploma or equivalent,"Extruding, Forming, Pressing and Compacting Machine Setters, Operators and Tenders","Extruding, Forming, Pressing and Compacting Machine Setters, Operators and Tenders",82,0.9300,71260,32510,71260,34370


In [37]:
# you can select information that contains a value
# will select only the columns related to X
select(jobs, contains("X"))



X_...rank,X_...code
624,51-4033
517,51-9012
484,41-4012
105,53-1031
620,51-4072
518,51-6091
427,51-4031
228,41-4011
590,51-4032
584,51-9041


In [38]:
#don't forget you have to override the information if you want to save over the variable
jobs <-rename(jobs, code=X_...code , rank= X_...rank)

## Ordering/Sorting

We can sort values in a dataframe with the function, `arrange()`

It takes a data frame and a set of column names (or more complicated expressions) to order by. 

In [39]:
# Here, we are going in order of probability first and if there is a tie, education level breaks said tie
#python jobs.sort_values(['prob', 'education'])

arrange(jobs, prob,education)

rank,code,prob,Average.annual.wage,education,occupation,short.occupation,len,probability,numbEmployed,median_ann_wage,employed_may2016,average_ann_wage
1,29-1125,0.0028,48190,Bachelor's degree,Recreational Therapists,Recreational Therapists,23,0.0028,18100,46410,18100,48190
3,11-9161,0.0030,78060,Bachelor's degree,Emergency Management Directors,Emergency Management Directors,30,0.0030,9570,70500,9570,78060
2,49-1011,0.0030,66730,High school diploma or equivalent,"First-Line Supervisors of Mechanics, Installers and Repairers","First-Line Supervisors of Mechanics, Installers and Repairers",61,0.0030,453330,63540,453330,66730
4,21-1023,0.0031,47880,Bachelor's degree,Mental Health and Substance Abuse Social Workers,Mental Health and Substance Abuse Social Workers,48,0.0031,114040,42700,114040,47880
5,29-1181,0.0033,79290,Doctoral or professional degree,Audiologists,Audiologists,12,0.0033,12310,75980,12310,79290
7,29-2091,0.0035,69920,Master's degree,Orthotists and Prosthetists,Orthotists and Prosthetists,27,0.0035,7500,65630,7500,69920
8,21-1022,0.0035,55510,Master's degree,Healthcare Social Workers,Healthcare Social Workers,25,0.0035,159310,53760,159310,55510
6,29-1122,0.0035,83730,Master's degree,Occupational Therapists,Occupational Therapists,23,0.0035,118070,81910,118070,83730
9,29-1022,0.0036,232870,Doctoral or professional degree,Oral and Maxillofacial Surgeons,Oral and Maxillofacial Surgeons,31,0.0036,5380,232870,5380,232870
10,33-1021,0.0036,77050,Postsecondary nondegree award,First-Line Supervisors of Fire Fighting and Prevention Workers,First-Line Supervisors of Fire Fighting and Prevention Workers,62,0.0036,57170,74540,57170,77050


In [40]:
arrange(jobs, desc(prob))

rank,code,prob,Average.annual.wage,education,occupation,short.occupation,len,probability,numbEmployed,median_ann_wage,employed_may2016,average_ann_wage
694,51-9151,0.99,31740,High school diploma or equivalent,Photographic Process Workers and Processing Machine Operators,Photographic Process Workers and Processing Machine Operators,61,0.99,26430,26470,26430,31740
701,23-2093,0.99,51490,High school diploma or equivalent,"Title Examiners, Abstractors and Searchers","Title Examiners, Abstractors and Searchers",42,0.99,54560,45800,54560,51490
696,43-5011,0.99,44250,High school diploma or equivalent,Cargo and Freight Agents,Cargo and Freight Agents,24,0.99,88920,41920,88920,44250
699,15-2091,0.99,58490,Bachelor's degree,Mathematical Technicians,Mathematical Technicians,24,0.99,510,49660,510,58490
698,13-2053,0.99,75480,Bachelor's degree,Insurance Underwriters,Insurance Underwriters,22,0.99,91650,67680,91650,75480
692,25-4031,0.99,34780,Postsecondary nondegree award,Library Technicians,Library Technicians,19,0.99,93410,32890,93410,34780
693,43-4141,0.99,36480,High school diploma or equivalent,New Accounts Clerks,New Accounts Clerks,19,0.99,41630,34990,41630,36480
691,43-9021,0.99,31640,High school diploma or equivalent,Data Entry Keyers,Data Entry Keyers,17,0.99,194810,30100,194810,31640
697,49-9064,0.99,39720,High school diploma or equivalent,Watch Repairers,Watch Repairers,15,0.99,1620,36740,1620,39720
695,13-2082,0.99,45340,High school diploma or equivalent,Tax Preparers,Tax Preparers,13,0.99,70030,36550,70030,45340


## Mutate

You may want to add a new columns that are functions of existing columns - and that function is `mutate()`

mutate() always adds new columns at the end of your dataset.

You can create a whole variety of new variables, as in python. Here are some useful tips on this: 

**Arithmetic operators**: +, -, *, /, ^. These are all vectorised, using the so called “recycling rules”. If one parameter is shorter than the other, it will be automatically extended to be the same length. 

**Modular arithmetic**: %/% (integer division) and %% (remainder), where x == y * (x %/% y) + (x %% y). Modular arithmetic is a handy tool because it allows you to break integers up into pieces. 

**Logs**: log(), log2(), log10(). Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude. 

**Offsets**: lead() and lag() allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. x - lag(x)) or find when values change (x != lag(x)). This is useful for regressions with time series.

**Logical comparisons**, <, <=, >, >=, !=, and == If you’re doing a complex sequence of logical operations it’s often a good idea to store the interim values in new variables so you can check that each step is working as expected.

**Ranking**: there are a number of ranking functions, but you should start with min_rank(). It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small ranks; use desc(x) to give the largest values the smallest ranks.


In [41]:
# here, we can see that the new variable diff is added on at the end
mutate(jobs, 
      diff = Average.annual.wage - median_ann_wage)

rank,code,prob,Average.annual.wage,education,occupation,short.occupation,len,probability,numbEmployed,median_ann_wage,employed_may2016,average_ann_wage,diff
624,51-4033,0.9500,34920,High school diploma or equivalent,"Grinding, Lapping, Polishing and Buffing Machine Tool Setters, Operators and Tenders, Metal and Plastic","Tool setters, operators and tenders",35,0.9500,74600,32890,74600,34920,2030
517,51-9012,0.8800,41450,High school diploma or equivalent,"Separating, Filtering, Clarifying, Precipitating and Still Machine Setters, Operators and Tenders","Tool setters, operators and tenders",35,0.8800,47160,38360,47160,41450,3090
484,41-4012,0.8500,68410,High school diploma or equivalent,"Sales Representatives, Wholesale and Manufacturing, Except Technical and Scientific Products","Sales Representatives, Wholesale and Manufacturing",92,0.8500,1404050,57140,1404050,68410,11270
105,53-1031,0.0290,59800,High school diploma or equivalent,First-Line Supervisors of Transportation and Material-Moving Machine and Vehicle Operators,Supervisors Transportation,26,0.0290,202760,57270,202760,59800,2530
620,51-4072,0.9500,32660,High school diploma or equivalent,"Molding, Coremaking and Casting Machine Setters, Operators and Tenders, Metal and Plastic","Molding, Coremaking and Casting Machine Setters, Operators and Tenders, Metal and Plastic",89,0.9500,145560,30480,145560,32660,2180
518,51-6091,0.8800,35420,High school diploma or equivalent,"Extruding and Forming Machine Setters, Operators and Tenders, Synthetic and Glass Fibers","Extruding and Forming Machine Setters, Operators and Tenders, Synthetic and Glass Fibers",88,0.8800,19340,34240,19340,35420,1180
427,51-4031,0.7800,34210,High school diploma or equivalent,"Cutting, Punching and Press Machine Setters, Operators and Tenders, Metal and Plastic","Cutting, Punching and Press Machine Setters, Operators and Tenders, Metal and Plastic",85,0.7800,192800,32370,192800,34210,1840
228,41-4011,0.2500,92910,Bachelor's degree,"Sales Representatives, Wholesale and Manufacturing, Technical and Scientific Products","Sales Representatives, Wholesale and Manufacturing, Technical and Scientific Products",85,0.2500,328370,78980,328370,92910,13930
590,51-4032,0.9400,38880,High school diploma or equivalent,"Drilling and Boring Machine Tool Setters, Operators and Tenders, Metal and Plastic","Drilling and Boring Machine Tool Setters, Operators and Tenders, Metal and Plastic",82,0.9400,12290,36410,12290,38880,2470
584,51-9041,0.9300,34370,High school diploma or equivalent,"Extruding, Forming, Pressing and Compacting Machine Setters, Operators and Tenders","Extruding, Forming, Pressing and Compacting Machine Setters, Operators and Tenders",82,0.9300,71260,32510,71260,34370,1860



If you only want to keep the new variables, use transmute():

In [42]:
transmute(jobs, 
      diff = Average.annual.wage - median_ann_wage)

diff
2030
3090
11270
2530
2180
1180
1840
13930
2470
1860


In [43]:
# You can combine mutate with boolean filters
jobs %>% filter(occupation %in% c('Economists')) %>% mutate(
      diff = Average.annual.wage - median_ann_wage)

rank,code,prob,Average.annual.wage,education,occupation,short.occupation,len,probability,numbEmployed,median_ann_wage,employed_may2016,average_ann_wage,diff
282,19-3011,0.43,112860,Master's degree,Economists,Economists,10,0.43,19380,101050,19380,112860,11810


## Groupby and Summarize

We can get simple summary statistics of our dataframe, like we did in pandas with describe (but a but more complicated)

In R, the functionn is the british spelling, `summarise()`

*technically, it works with a z, too.

In [44]:
# This is is the mean probability of the entire dataset excluding any values that are NA
summarise(jobs, prob = mean(prob, na.rm=TRUE))

prob
0.5355499


We can use summarise in conjuction with groupby, which is the same process as in python pandas. It will split the data into groups that you need and then you will use the summarise function to apply a statistic to those groups.

In [45]:
# we can create multiple new items in one groupby function
by_educ <- group_by(jobs, education)
summarise(by_educ, av_pat = mean(Average.annual.wage, na.rm=TRUE),
         count = n())

education,av_pat,count
Associate's degree,56492.5,44
Bachelor's degree,80601.74,155
Doctoral or professional degree,126743.04,23
High school diploma or equivalent,44011.5,307
Master's degree,75965.86,29
No formal educational credential,33030.83,98
Postsecondary nondegree award,48555.48,42
"Some college, no degree",44515.9,4


# Breakout Group Exercises

I want you to start to get familiar with R already and deal with any trouble shooting issues that you might have. 
1. set your working directory to where the jobs data is located
2. import the data "job-automation-probability.csv"
3. select the data columns short.occupation, education, prob, average_ann_wage, X_...code
4. calculate the minimum probabilty by 'education'
5. create a new variable that calculates the difference between the minimum and maximum probability values by education and ensure that item 4 is in the same dataframe.

<img src ='https://media.giphy.com/media/13HgwGsXF0aiGY/giphy.gif' width = 300>

## Next Week
### What is "tidy" data?

Resources:
- [Vignette](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) (from the **tidyr** package)
- [Original paper](https://vita.had.co.nz/papers/tidy-data.pdf) (Hadley Wickham, 2014 JSS) <- this author is the same as the boko I mentioned earlier