In [1]:
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import json
%matplotlib inline

pd.options.display.max_columns = 50
pd.options.display.max_colwidth = 300

## Predicting "Greenness" Of Content

This dataset comes from [stumbleupon](https://www.stumbleupon.com/), a web page recommender and was made available [here](https://www.kaggle.com/c/stumbleupon/download/train.tsv). You will need to download the data into this folder.

A description of the columns is below

FieldName|Type|Description
---------|----|-----------
url|string|Url of the webpage to be classified
urlid|integer| StumbleUpon's unique identifier for each url
boilerplate|json|Boilerplate text
alchemy_category|string|Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score|double|Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize| double|Average number of words in each link
commonLinkRatio_1|double|# of links sharing at least 1 word with 1 other links / # of links
commonLinkRatio_2|double|# of links sharing at least 1 word with 2 other links / # of links
commonLinkRatio_3|double|# of links sharing at least 1 word with 3 other links / # of links
commonLinkRatio_4|double|# of links sharing at least 1 word with 4 other links / # of links
compression_ratio|double|Compression achieved on this page via gzip (measure of redundancy)
embed_ratio|double|Count of number of <embed> usage
frameBased|integer (0 or 1)|A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio|double|Ratio of iframe markups over total number of markups
hasDomainLink|integer (0 or 1)|True (1) if it contains an <a> with an url with domain
html_ratio|double|Ratio of tags vs text in the page
image_ratio|double|Ratio of <img> tags vs text in the page
is_news|integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain| integer (0 or 1)|True (1) if at least 3 <a> 's text contains more than 30 alphanumeric characters
linkwordscore|double|Percentage of words on the page that are in hyperlink's text
news_front_page| integer (0 or 1)|True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters|integer| Page's text's number of alphanumeric characters
numberOfLinks|integer Number of <a>|markups
numwords_in_url| double|Number of words in url
parametrizedLinkRatio|double|A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio|double|Ratio of words not found in wiki (considered to be a spelling mistake)
label|integer (0 or 1)|User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

### What are 'evergreen' sites?
- These are websites that always relevant like recipes or reviews (as opposed to current events)
- Look at some examples

In [2]:
data = pd.read_csv('~/Workspace/Data/GA-Labs/stumble_data/train.tsv', sep='\t', na_values={'is_news' : '?'}).fillna(0)

# Extract the title and body from the boilerplate JSON text
data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))

In [3]:
data[['title', 'label']].head()

Unnamed: 0,title,label
0,"IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries",0
1,"The Fully Electronic Futuristic Starting Gun That Eliminates Advantages in Races the fully electronic, futuristic starting gun that eliminates advantages in races the fully electronic, futuristic starting gun that eliminates advantages in races",1
2,Fruits that Fight the Flu fruits that fight the flu | cold & flu | men's health,1
3,10 Foolproof Tips for Better Sleep,1
4,The 50 Coolest Jerseys You Didn t Know Existed coolest jerseys you haven't seen,0


#### Does being a news site effect green-ness?

In [None]:
news_y = data[data.is_news == 1].label.sum()
news_n = data[data.is_news == 0].label.sum()
float(news_y)/float(news_n)

In [10]:
import statsmodels.formula.api as sm
import pandas as pd


df = data

model = sm.logit(
    "label ~ is_news",
    data = df
).fit()

model.summary()

Optimization terminated successfully.
         Current function value: 0.692751
         Iterations 3


0,1,2,3
Dep. Variable:,label,No. Observations:,7395.0
Model:,Logit,Df Residuals:,7393.0
Method:,MLE,Df Model:,1.0
Date:,"Tue, 25 Jul 2017",Pseudo R-squ.:,5.98e-05
Time:,18:18:26,Log-Likelihood:,-5122.9
converged:,True,LL-Null:,-5123.2
,,LLR p-value:,0.4337

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.0303,0.038,0.806,0.420,-0.043,0.104
is_news,0.0374,0.048,0.783,0.434,-0.056,0.131


Being a news site means you are more likely to be green, but ti doesn't seem to have a significant statistical effect, which the high p value.

#### Does the website category effect green-ness?

In [None]:
print("green count")
print(data.groupby(data.alchemy_category)['label'].sum())
print("\n")
print("total count")
print(data.groupby(data.alchemy_category)['label'].count())

In [11]:
df = data

model = sm.logit(
    "label ~ alchemy_category",
    data = df
).fit()

model.summary()

         Current function value: 0.649452
         Iterations: 35




0,1,2,3
Dep. Variable:,label,No. Observations:,7395.0
Model:,Logit,Df Residuals:,7381.0
Method:,MLE,Df Model:,13.0
Date:,"Tue, 25 Jul 2017",Pseudo R-squ.:,0.06256
Time:,18:18:57,Log-Likelihood:,-4802.7
converged:,False,LL-Null:,-5123.2
,,LLR p-value:,1.372e-128

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.0085,0.041,0.207,0.836,-0.072,0.090
alchemy_category[T.arts_entertainment],-0.5324,0.079,-6.731,0.000,-0.687,-0.377
alchemy_category[T.business],0.8935,0.085,10.499,0.000,0.727,1.060
alchemy_category[T.computer_internet],-1.1253,0.141,-7.979,0.000,-1.402,-0.849
alchemy_category[T.culture_politics],-0.1780,0.116,-1.535,0.125,-0.405,0.049
alchemy_category[T.gaming],-0.5475,0.241,-2.269,0.023,-1.021,-0.074
alchemy_category[T.health],0.2861,0.099,2.892,0.004,0.092,0.480
alchemy_category[T.law_crime],-0.3340,0.366,-0.912,0.362,-1.052,0.384
alchemy_category[T.recreation],0.7650,0.074,10.340,0.000,0.620,0.910


Our model did not converge, so it's hard to say for usre how great this fit is. But looking at the p values it would seem that some categories appear to have a significant statistical effect. Entertainment, business, computer_internet, sports to name a few.

#### Does the image ratio effect green-ness?

In [None]:
green = data[data.label == 1]
not_green = data[data.label == 0]
print (green.image_ratio.describe())
print ("\n")
print (not_green.image_ratio.describe())

In [12]:

df = data

model = sm.logit(
    "label ~ image_ratio",
    data = df
).fit()

model.summary()

Optimization terminated successfully.
         Current function value: 0.692631
         Iterations 5


0,1,2,3
Dep. Variable:,label,No. Observations:,7395.0
Model:,Logit,Df Residuals:,7393.0
Method:,MLE,Df Model:,1.0
Date:,"Tue, 25 Jul 2017",Pseudo R-squ.:,0.0002325
Time:,18:19:55,Log-Likelihood:,-5122.0
converged:,True,LL-Null:,-5123.2
,,LLR p-value:,0.1228

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.0590,0.024,2.499,0.012,0.013,0.105
image_ratio,-0.0210,0.015,-1.400,0.161,-0.051,0.008


There is not a c

#### Fit a logistic regression model using statsmodels
- Test different features that may be valuable
- Examine the coefficients, does the feature increase or decrease the effect of being evergreen?

In [8]:
df = data

model = sm.logit(
    "label ~ image_ratio + is_news",
    data = df
).fit()

model.summary()

Optimization terminated successfully.
         Current function value: 0.692598
         Iterations 5


0,1,2,3
Dep. Variable:,label,No. Observations:,7395.0
Model:,Logit,Df Residuals:,7392.0
Method:,MLE,Df Model:,2.0
Date:,"Tue, 25 Jul 2017",Pseudo R-squ.:,0.0002808
Time:,18:18:20,Log-Likelihood:,-5121.8
converged:,True,LL-Null:,-5123.2
,,LLR p-value:,0.2373

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.0381,0.038,1.005,0.315,-0.036,0.112
image_ratio,-0.0205,0.015,-1.369,0.171,-0.050,0.009
is_news,0.0337,0.048,0.704,0.482,-0.060,0.128


#### Fit a logistic regression model using statsmodels with text features
- Add text features that may be useful, add this to the model and see if they improve the fit
- Examine the coefficients, does the feature increase or decrease the effect of being evergreen?

In [12]:
# EXAMPLE text feature 'recipe'
data['is_recipe'] = data['title'].fillna('').str.contains('recipe')
data['is_health'] =  data['title'].fillna('').str.contains('health')
data['is_sports'] =  data['title'].fillna('').str.contains('sport')

In [13]:
model = sm.logit(
    "label ~ is_sports + is_recipe",
    data = df
).fit()

model.summary()

Optimization terminated successfully.
         Current function value: 0.671803
         Iterations 6


0,1,2,3
Dep. Variable:,label,No. Observations:,7395.0
Model:,Logit,Df Residuals:,7392.0
Method:,MLE,Df Model:,2.0
Date:,"Tue, 25 Jul 2017",Pseudo R-squ.:,0.0303
Time:,18:08:18,Log-Likelihood:,-4968.0
converged:,True,LL-Null:,-5123.2
,,LLR p-value:,3.906e-68

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.0055,0.024,0.229,0.819,-0.042,0.053
is_sports[T.True],-1.4903,0.207,-7.197,0.000,-1.896,-1.084
is_recipe[T.True],2.1095,0.173,12.165,0.000,1.770,2.449


It appears that sports has a small correlation with whether something is green, while recipes have a larger correlation.

In [79]:
import math
print math.e**(2.49)
print math.e**(-1.49)

12.0612761204
0.225372655539


So a recipe is 12 times as likely to be consider green as than it is to not be green.

Meaning sports are only 0.225 times as likely to be yes than no, or in other words they ar emore likely to not be green.

The coefficients prove to be statistically significant with very low p values. But our R^2 is superr small.