### Being a data scientist, the real sh*t:

* Data scientists in real life have multiple goals that fit under the
  general category of "making sense of data" or "turning data into
  insights".
  * Often we are focused on building a predictive model, and on
    maximizing the predictive model's accuracy. Sometimes this focus
    is overemphasized. This is a thing that we do, but this is not the
    only thing.
    * Things that fail are often still interesting insights.
    * Anecdotes are often interesting insights.
    * The utility of an analysis is independent of the sophistication
      of the algorithm. Sometimes the most mind-blowing insights come
      from lists, tables, histograms, or scatter plots. Don't throw
      out cool stuff that isn't technically advanced unless absolutely
      necessary.
      * Don't over-design and under-deliver. For every data science
        project that you see or hear about, the version in the data
        scientist's head was probably fancier, bigger, more
        comprehensive, more elegant, presented in a cooler format, or
        with better copy, et cetera, ad nauseum. The reason you heard
        about it at all, however, is because it was *finished*, and
        published or released in all its heart-wrenching
        imperfection.
        * Start with something small, and build from there, as
        necessary, as time allows.
        * Jot down the elaborations, next steps, uh-ohs, or grand
          ideas that strike you as you are working. Leave them alone
          for a while and then come back and look at them later.
          * Many things that feel like huge "uhoh"s in the heat of the
            moment are actually small deals or even false alarms. The
            fewer of these you spend time on, the better.




In [1]:
import requests
import re
import time
import os
import pandas as pd
from bs4 import BeautifulSoup
from IPython.core.display import display, HTML
from selenium import webdriver
from selenium.webdriver.common.keys import Keys


#import diagnostic_plots
import patsy
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import KFold
% matplotlib inline



In [2]:
# school starting page:  http://www.ratemyprofessors.com/search.jsp?query=&queryoption=HEADER&stateselect=&country=&dept=&queryBy=teacherName&facetSearch=&schoolName=University+of+California+Berkeley&offset=0&max=20
# example professor: http://www.ratemyprofessors.com/ShowRatings.jsp?tid=7503

# each school contains multiple pages of listing. school_url_flex is used to generate link of different pages
# links with _flex suffix means the link is flexible and can take in different parameters for website navigation

school_url="http://www.ratemyprofessors.com/search.jsp?query=&queryoption=HEADER&stateselect=&country=&dept=&queryBy=teacherName&facetSearch=&schoolName=University+of+Florida&offset=0&max=20"
school_url_flex="http://www.ratemyprofessors.com/search.jsp?query=&queryoption=HEADER&stateselect=&country=&dept=&queryBy=teacherName&facetSearch=&schoolName=University+of+Florida&offset={}&max=20"
prof_url="http://www.ratemyprofessors.com/ShowRatings.jsp?tid=144"
prof_url_flex="http://www.ratemyprofessors.com{}"

prof_response=requests.get(prof_url)
school_response=requests.get(school_url)

print(prof_response.status_code)
print(school_response.status_code)



200
200


In [92]:
prof_page=prof_response.text
school_page=school_response.text

prof_soup = BeautifulSoup(prof_page,"lxml")
school_soup = BeautifulSoup(school_page,"lxml")

In [93]:
# total number of professor listings in a school. Given school soup, retun number of professors

def total_professors(school_soup):
    for e in school_soup.find_all(class_="toppager"):     #(class_="toppager-left"):
        temp=e.find(class_="result-count").text
       # name=e.find(class_="pfname").text.strip()+" "+e.find(class_="plname").text.strip()
        result=re.findall(r'\d+', temp)
        return int(max(result)) #usually the pages shows 1-20 records out of x result. x would be the maximum of the three number

total_prof=total_professors(school_soup)
total_prof

5307

In [94]:
def page_of_listing(total_professors):
    pages=total_professors//20+1
    return pages

test=page_of_listing(total_prof)
test

266

In [95]:
# given school's flexible url and pages of listing, generae a list of urls of all webpage pages
def page_urls(url_flex, pages):
    list_urls=[]
    for i in range (0,pages):
        offset=i*20
        page_url=url_flex.format(offset)
        list_urls.append(page_url)
    return list_urls

test_urls=page_urls(school_url_flex, 2)
test_urls



['http://www.ratemyprofessors.com/search.jsp?query=&queryoption=HEADER&stateselect=&country=&dept=&queryBy=teacherName&facetSearch=&schoolName=University+of+Florida&offset=0&max=20',
 'http://www.ratemyprofessors.com/search.jsp?query=&queryoption=HEADER&stateselect=&country=&dept=&queryBy=teacherName&facetSearch=&schoolName=University+of+Florida&offset=20&max=20']

In [96]:
# generate links for professsors. Give list of page_links, find out a list of links for professors


def prof_urls(page_links, url_flex):
    url_listing=[]
    for link in page_links:
        temp_response=requests.get(link)
        temp_page=temp_response.text
        temp_soup = BeautifulSoup(temp_page,"lxml")
        for e in temp_soup.find_all('li',class_="listing PROFESSOR"):  
            temp= e.find('a')['href']
            prof_url=prof_url_flex.format(temp)
            url_listing.append(prof_url)
    return url_listing
                    
    
test_prof_urls=prof_urls(test_urls, prof_url_flex)
test_prof_urls


['http://www.ratemyprofessors.com/ShowRatings.jsp?tid=144',
 'http://www.ratemyprofessors.com/ShowRatings.jsp?tid=8086',
 'http://www.ratemyprofessors.com/ShowRatings.jsp?tid=8090',
 'http://www.ratemyprofessors.com/ShowRatings.jsp?tid=8154',
 'http://www.ratemyprofessors.com/ShowRatings.jsp?tid=8155',
 'http://www.ratemyprofessors.com/ShowRatings.jsp?tid=10897',
 'http://www.ratemyprofessors.com/ShowRatings.jsp?tid=16038',
 'http://www.ratemyprofessors.com/ShowRatings.jsp?tid=16519',
 'http://www.ratemyprofessors.com/ShowRatings.jsp?tid=16520',
 'http://www.ratemyprofessors.com/ShowRatings.jsp?tid=16521',
 'http://www.ratemyprofessors.com/ShowRatings.jsp?tid=16522',
 'http://www.ratemyprofessors.com/ShowRatings.jsp?tid=16529',
 'http://www.ratemyprofessors.com/ShowRatings.jsp?tid=17669',
 'http://www.ratemyprofessors.com/ShowRatings.jsp?tid=23963',
 'http://www.ratemyprofessors.com/ShowRatings.jsp?tid=23964',
 'http://www.ratemyprofessors.com/ShowRatings.jsp?tid=23971',
 'http://www.r

In [97]:
# grab professor name. Given school soup, retun string

def get_name(prof_soup):
    name=""
    if prof_soup.find_all(class_="profname"):
        for e in prof_soup.find_all(class_="profname"):
            name=e.find(class_="pfname").text.strip()+" "+e.find(class_="plname").text.strip()
    return name

get_name (prof_soup)

'John Griffith'

In [98]:
# Grab the rating of professor. Given Soup, retun float

def get_rating(prof_soup):
    temp=""
    if prof_soup.find(class_="breakdown-container quality"):
        temp=prof_soup.find(class_="breakdown-container quality")
    if temp: 
        s=temp.find(class_='grade')
        return float(s.text)
    else: 
        return (0)
            
            
get_rating(prof_soup)
        
# can be list of strings

4.3

In [99]:
# Grab the level of difficulty of the professor. iven Soup, retun float

def get_level_of_difficulty(prof_soup):
    
    if prof_soup.find(class_="breakdown-section difficulty"):
    
        level_of_difficulty=prof_soup.find(class_="breakdown-section difficulty").stripped_strings

        return float(list (level_of_difficulty)[1])
    else:
        return (2.5) #return average difficulty level if no value

get_level_of_difficulty(prof_soup)
    


3.4

In [100]:
# Grab total number reviews

def get_number_reviews(prof_soup):
    
    if prof_soup.find('div',class_="table-toggle rating-count active"):
        text=prof_soup.find('div',class_="table-toggle rating-count active").text.strip()
        num_students=int(re.findall(r'\d+', text, re.I)[0])
        return num_students
    else:
        return (0)

get_number_reviews(prof_soup)

5

In [101]:
# get all tags of professor
tags_url="http://www.ratemyprofessors.com/AddRating.jsp?tid=9670"
tags_page=requests.get(tags_url).text
tags_soup = BeautifulSoup(tags_page,"lxml")

all_tags=[]
for e in tags_soup.find_all('div', class_="scrollable"):  #entire tag
    for f in e.find_all('a',class_=''): #each tag was embed in tag-box-choosetags class
        all_tags.append (f.text.strip().capitalize())
all_tags.sort()
all_tags


['Accessible outside class',
 'Amazing lectures',
 'Beware of pop quizzes',
 'Caring',
 'Clear grading criteria',
 'Extra credit',
 'Get ready to read',
 'Gives good feedback',
 'Graded by few things',
 'Group projects',
 'Hilarious',
 'Inspirational',
 'Lecture heavy',
 'Lots of homework',
 'Participation matters',
 'Respected',
 "Skip class? you won't pass.",
 'So many papers',
 'Test heavy',
 'Tough grader']

In [102]:
'''
all_tags=['Tough Grader',
 'Gives good feedback',
 'Respected',
 'Get ready to read',
 'Participation matters',
 "Skip class? You won't pass.",
 'LOTS OF HOMEWORK',
 'Inspirational',
 'BEWARE OF POP QUIZZES',
 'ACCESSIBLE OUTSIDE CLASS',
 'SO MANY PAPERS',
 'Clear grading criteria',
 'Hilarious',
 'TEST HEAVY',
 'GRADED BY FEW THINGS',
 'Amazing lectures',
 'Caring',
 'EXTRA CREDIT',
 'GROUP PROJECTS',
 'LECTURE HEAVY']
'''
# Grab the main tags of the professor. Given soup, return tags


def get_tags(prof_soup):
    list_of_tags=[]
    dic={}
    total_count=0
    for e in prof_soup.find_all(class_="tag-box"):  #entire tag
        for f in e.find_all(class_='tag-box-choosetags'): #each tag was embed in tag-box-choosetags class
            list_of_tags.append (f.text.strip())
    
    # sort result list for efficiency
    list_of_tags.sort()
    
    #split text and add the count to each tag
    for i in list_of_tags:
        category=re.findall(r'[^\(]*', i, re.I)[0].strip().capitalize()
        count=int(re.findall(r'\d', i, re.I)[0])
        dic[category]=count
        total_count+=count
    
    #normalize the count for each tag
    for key in dic:
        dic[key]=round(dic[key]/total_count,2)
           
    return dic
    
    
    
have=get_tags(prof_soup)
have


{}

In [103]:
# Count and normalize all tags a professor has in review and return a list of counts
# All tags is a lit and repreents entire listing of tags. Have_tags is a dictionary with normalized count of tags

def tag_count(all_tags, have_tags):
    dic={}
    lis=[]
    for key in all_tags:
        dic[key]=0
    
    for key in have_tags:
        if key in all_tags:
            dic[key]=have_tags[key]
    
    for key in all_tags:
        lis.append(dic[key])
    
    return lis

tag_count(all_tags, have)
          

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [114]:

total_prof=total_professors(school_soup)
pages=page_of_listing(total_prof)

result_urls=page_urls(school_url_flex, pages)

print ("Total professors: "+str(total_prof), "Pages: "+str(pages))

def scrape_data(urls):
    data=[]
    temp_page_links=prof_urls(urls, prof_url_flex)
    for index, link2 in enumerate(temp_page_links):
        # added statement to skip error as needed
        try:         
            temp=[]
            temp_response=requests.get(link2)
            temp_page=temp_response.text
            temp_soup = BeautifulSoup(temp_page,"lxml")
            temp_name=get_name (temp_soup)
            temp_rating=get_rating(temp_soup)
            temp_difficulty=get_level_of_difficulty(temp_soup)
            temp_numbers_reviews=get_number_reviews(temp_soup)
            have_tags=get_tags(temp_soup) 
            temp_tags=tag_count(all_tags, have_tags)

            temp.extend([temp_name, temp_rating, temp_difficulty,temp_numbers_reviews])
            temp.extend(temp_tags)

            print(str(index)+", ",end="") # count the instance. print on same line
            data.append(temp)
            time.sleep(1)
        except:
            continue
    return data


                    
    

Total professors: 5307 Pages: 266


In [115]:
column_names=['Name', 'Rating', 'Level of difficulty','Total reviews']
column_names.extend(all_tags)
column_names

['Name',
 'Rating',
 'Level of difficulty',
 'Total reviews',
 'Accessible outside class',
 'Amazing lectures',
 'Beware of pop quizzes',
 'Caring',
 'Clear grading criteria',
 'Extra credit',
 'Get ready to read',
 'Gives good feedback',
 'Graded by few things',
 'Group projects',
 'Hilarious',
 'Inspirational',
 'Lecture heavy',
 'Lots of homework',
 'Participation matters',
 'Respected',
 "Skip class? you won't pass.",
 'So many papers',
 'Test heavy',
 'Tough grader']

In [4]:
#save to pick for efficiency

import pickle

filename = '/Users/xzhou/github/project_files/project_luther/professor_data_uf.pkl' #5307 records

try:
    with open(filename,'rb') as pklfile:
        df = pickle.load(pklfile)
except:
    result=scrape_data(result_urls)
    df=pd.DataFrame(result, columns=column_names)
    with open(filename,'wb') as pklfile:
        df = pickle.dump(df, pklfile)

In [4]:
with open(filename,'rb') as pklfile:
    df = pickle.load(pklfile)

In [6]:
df.head()

Unnamed: 0,Name,Rating,Level of difficulty,Total reviews,Accessible outside class,Amazing lectures,Beware of pop quizzes,Caring,Clear grading criteria,Extra credit,...,Hilarious,Inspirational,Lecture heavy,Lots of homework,Participation matters,Respected,Skip class? you won't pass.,So many papers,Test heavy,Tough grader
0,John Griffith,4.3,3.4,5,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Micheal Moulton,4.2,1.2,218,0.04,0.0,0.0,0.11,0.32,0.18,...,0.11,0.04,0.0,0.0,0.0,0.04,0.0,0.0,0.04,0.0
2,Liz Seiberling,2.3,3.6,15,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,David Groisser,2.5,4.6,26,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.5
4,Alex Turull,4.7,2.4,19,0.0,0.0,0.43,0.0,0.0,0.0,...,0.0,0.0,0.0,0.14,0.0,0.0,0.29,0.0,0.0,0.14


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5306 entries, 0 to 5305
Data columns (total 24 columns):
Name                           5306 non-null object
Rating                         5306 non-null float64
Level of difficulty            5306 non-null float64
Total reviews                  5306 non-null int64
Accessible outside class       5306 non-null float64
Amazing lectures               5306 non-null float64
Beware of pop quizzes          5306 non-null float64
Caring                         5306 non-null float64
Clear grading criteria         5306 non-null float64
Extra credit                   5306 non-null float64
Get ready to read              5306 non-null float64
Gives good feedback            5306 non-null float64
Graded by few things           5306 non-null float64
Group projects                 5306 non-null float64
Hilarious                      5306 non-null float64
Inspirational                  5306 non-null float64
Lecture heavy                  5306 non-null flo

In [122]:
df.shape

(5306, 24)

In [9]:
# Capture additional features for rating prediction

filename_add_features = '/Users/xzhou/github/project_files/project_luther/professor_data_uf_add_features.pkl' #3986 records

try:
    with open(filename_add_features,'rb') as pklfile:
        df2 = pickle.load(pklfile)
except:
    with open(filename,'rb') as pklfile:
        df2 = pickle.load(pklfile)

        df2.insert(4, 'Region_south', 1)
        df2.insert(4, 'Region_east', 0)
        df2.insert(4, 'Region_west', 0)
        df2.insert(4, 'Type_private', 0)
        df2.insert(4, 'Type_public',1 )
        df2.insert(4, 'Student size',52367 )


        with open(filename_add_features,'wb') as pklfile:
            df2 = pickle.dump(df2, pklfile)

In [10]:
with open(filename_add_features,'rb') as pklfile:
    df2 = pickle.load(pklfile)

df2.head()

Unnamed: 0,Name,Rating,Level of difficulty,Total reviews,Student size,Type_public,Type_private,Region_west,Region_east,Region_south,...,Hilarious,Inspirational,Lecture heavy,Lots of homework,Participation matters,Respected,Skip class? you won't pass.,So many papers,Test heavy,Tough grader
0,John Griffith,4.3,3.4,5,52367,1,0,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Micheal Moulton,4.2,1.2,218,52367,1,0,0,0,1,...,0.11,0.04,0.0,0.0,0.0,0.04,0.0,0.0,0.04,0.0
2,Liz Seiberling,2.3,3.6,15,52367,1,0,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,David Groisser,2.5,4.6,26,52367,1,0,0,0,1,...,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.5
4,Alex Turull,4.7,2.4,19,52367,1,0,0,0,1,...,0.0,0.0,0.0,0.14,0.0,0.0,0.29,0.0,0.0,0.14
