<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 5


## NLP and Machine Learning on [travel.statsexchange.com](http://travel.stackexchange.com/) data

---

In Project 7 you'll be doing NLP and machine learning on post data from stackexchange's travel subdomain. 

This project is setup like a mini Kaggle competition. You are given the training data and when projects are submitted your model will be tested on the held-out testing data. There will be prizes for the people who build models that perform best on the held out test set!

---

## Notes on the data

The data is again compressed into the `.7z` file format to save space. There are 6 .csv files and one readme file that contains some information on the fields.

    posts_train.csv
    comments_train.csv
    users.csv
    badges.csv
    votes_train.csv
    tags.csv
    readme.txt
    
The data is located in your datasets folder:

    DSI-SF-2/datasets/stack_exchange_travel.7z
    
If you're interested in where this data came from and where to get more data from other stackexchange subdomains, see here:

https://ia800500.us.archive.org/22/items/stackexchange/readme.txt


### Recommended Utilities for .7z

- For OSX [Keka](http://www.kekaosx.com/en/) or [The Unarchiver](http://wakaba.c3.cx/s/apps/unarchiver.html). 
- For Windows [7-zip](http://www.7-zip.org/) is the standard. 
- For Linux try the `p7zip` utility.  `sudo apt-get install p7zip`.



<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 1. Use LDA to find what topics are discussed on travel.stackexchange.com.

---

Text can be found in the posts and the comments datasets. The `ParentId` column in the posts dataset indicates what the "question" post was for a given post. Comment text can be merged onto the post they are part of with the `PostId` field.

The text may have some HTML tags. BeautifulSoup has convenient ways to get rid of markup or extract text if you need to. You can also parse the strings yourself if you like.

The tags dataset has the "tags" that the users have officially given the post.

**1.1 Implement LDA against the text features of the dataset(s).**

- This can be posts or a combination of posts and comments if you want more power.
- Find optimal **K/num_topics**.

**1.2 Compare your topics to the tags. Do the LDA topics make sense? How do they compare to the tags?**


In [1]:
from gensim import corpora, models, matutils
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from collections import defaultdict
import pandas as pd
import seaborn as sns

In [150]:
comments = pd.read_csv('/Users/sashakapralov/Desktop/DSI-SF-5-Working/datasets/stack_exchange_travel/comments_train.csv')
posts = pd.read_csv('/Users/sashakapralov/Desktop/DSI-SF-5-Working/datasets/stack_exchange_travel/posts_train.csv')
tags = pd.read_csv('/Users/sashakapralov/Desktop/DSI-SF-5-Working/datasets/stack_exchange_travel/tags.csv')

In [151]:
tags.head()

Unnamed: 0,Count,ExcerptPostId,Id,TagName,WikiPostId
0,75,2138.0,1,cruising,2137.0
1,39,357.0,2,caribbean,356.0
2,31,319.0,4,vacations,318.0
3,6,14548.0,6,amazon-river,14547.0
4,74,1792.0,8,romania,1791.0


In [193]:
tags.shape

(1606, 5)

In [152]:
tags.TagName.unique().size

1606

In [194]:
tags.ExcerptPostId.unique().size

1391

In [195]:
tags.WikiPostId.unique().size

1391

In [153]:
(tags.ExcerptPostId - tags.WikiPostId).unique()

array([  1.,  nan])

In [154]:
tags.ExcerptPostId.isnull().sum()

216

In [155]:
tags.WikiPostId.isnull().sum()

216

In [156]:
comments.head()

Unnamed: 0,CreationDate,Id,PostId,Score,Text,UserDisplayName,UserId
0,2011-06-21T20:25:14.257,1,1,0,To help with the cruise line question: Where a...,,12.0
1,2011-06-21T20:27:35.300,2,1,0,"Toronto, Ontario. We can fly out of anywhere t...",,9.0
2,2011-06-21T20:32:23.687,3,1,3,"""Best"" for what? Please read [this page](http...",,20.0
3,2011-06-21T20:42:08.330,9,25,0,"Are you in the UK? If so, would be helpful to ...",,30.0
4,2011-06-21T20:44:09.990,12,26,3,"Where are you starting from, and what sort of ...",,26.0


In [157]:
comments.PostId.unique()

array([    1,    25,    26, ..., 71181, 71190, 71212])

In [158]:
comments.shape

(81506, 7)

In [159]:
posts.columns

Index([u'AcceptedAnswerId', u'AnswerCount', u'Body', u'ClosedDate',
       u'CommentCount', u'CommunityOwnedDate', u'CreationDate',
       u'FavoriteCount', u'Id', u'LastActivityDate', u'LastEditDate',
       u'LastEditorDisplayName', u'LastEditorUserId', u'OwnerDisplayName',
       u'OwnerUserId', u'ParentId', u'PostTypeId', u'Score', u'Tags', u'Title',
       u'ViewCount'],
      dtype='object')

In [160]:
posts.shape

(41289, 21)

In [161]:
posts.dtypes

AcceptedAnswerId         float64
AnswerCount              float64
Body                      object
ClosedDate                object
CommentCount               int64
CommunityOwnedDate        object
CreationDate              object
FavoriteCount            float64
Id                         int64
LastActivityDate          object
LastEditDate              object
LastEditorDisplayName     object
LastEditorUserId         float64
OwnerDisplayName          object
OwnerUserId              float64
ParentId                 float64
PostTypeId                 int64
Score                      int64
Tags                      object
Title                     object
ViewCount                float64
dtype: object

In [162]:
posts = posts[['Body','Id','ParentId','PostTypeId','Score','Tags','Title']]

In [163]:
posts.Body.head()

0    <p>My fiancée and I are looking for a good Car...
1    <p>Singapore Airlines has an all-business clas...
2    <p>Another definition question that interested...
3    <p>Can anyone suggest the best way to get from...
4    <p>We are considering visiting Argentina for u...
Name: Body, dtype: object

In [164]:
posts.Body[0]

"<p>My fianc\xc3\xa9e and I are looking for a good Caribbean cruise in October and were wondering which islands are best to see and which Cruise line to take?</p>\n\n<p>It seems like a lot of the cruises don't run in this month due to Hurricane season so I'm looking for other good options.</p>\n\n<p><strong>EDIT</strong> We'll be travelling in 2012.</p>\n"

In [165]:
posts.PostTypeId.unique()

array([1, 2, 5, 4, 7, 6])

In [166]:
posts.PostTypeId.value_counts()

2    23967
1    13988
5     1656
4     1656
6       18
7        4
Name: PostTypeId, dtype: int64

In [225]:
posts[posts.PostTypeId == 5]

Unnamed: 0,Body,PostId,ParentId,PostTypeId,Score,Tags,Title,Body_clean
91,"<h3><img src=""http://i.stack.imgur.com/Q6l0V.p...",157,,5,0,,,"[ United States of America (USA), A large coun..."
93,"<p><a href=""http://en.wikipedia.org/wiki/Antar...",163,,5,0,,,[Southernmost continent and mostly covered in ...
148,,242,,5,0,,,[nan]
156,,252,,5,0,,,[nan]
158,"<p><img src=""http://i.stack.imgur.com/p9b0Y.jp...",254,,5,0,,,"[, ]"
160,<p>Practice of creating pictures about some ev...,256,,5,0,,,[Practice of creating pictures about some even...
163,,259,,5,0,,,[nan]
182,,280,,5,0,,,[nan]
185,<p>Use this tag when you want to travel by tra...,283,,5,0,,,[Use this tag when you want to travel by train...
187,<p>For questions relating to the topic of Euro...,285,,5,0,,,[For questions relating to the topic of Europe...


In [167]:
posts.Id.unique()

array([    1,     4,     5, ..., 71208, 71209, 71212])

In [168]:
posts.rename(columns={'Id': 'PostId'}, inplace=True)

In [169]:
posts[posts.PostId == 71212].Body

41288    <p>When you enter one of the three micro-state...
Name: Body, dtype: object

In [170]:
comments[comments.PostId == 71212].Text

81500    I *assume* that the same logic applies to Disn...
81502    I am somewhat skeptical about that assumption....
81504    @AndrewLazarus, that would make a great questi...
Name: Text, dtype: object

In [171]:
comments.rename(columns={'Text': 'CommentText'}, inplace=True)

In [172]:
comments.columns

Index([u'CreationDate', u'Id', u'PostId', u'Score', u'CommentText',
       u'UserDisplayName', u'UserId'],
      dtype='object')

In [173]:
comments = comments[['PostId','CommentText']]

In [174]:
posts_comments = pd.merge(posts, comments, on='PostId')

In [175]:
posts_comments.shape

(81506, 8)

In [176]:
posts_comments.head()

Unnamed: 0,Body,PostId,ParentId,PostTypeId,Score,Tags,Title,CommentText
0,<p>My fiancée and I are looking for a good Car...,1,,1,8,<caribbean><cruising><vacations>,What are some Caribbean cruises for October?,To help with the cruise line question: Where a...
1,<p>My fiancée and I are looking for a good Car...,1,,1,8,<caribbean><cruising><vacations>,What are some Caribbean cruises for October?,"Toronto, Ontario. We can fly out of anywhere t..."
2,<p>My fiancée and I are looking for a good Car...,1,,1,8,<caribbean><cruising><vacations>,What are some Caribbean cruises for October?,"""Best"" for what? Please read [this page](http..."
3,<p>My fiancée and I are looking for a good Car...,1,,1,8,<caribbean><cruising><vacations>,What are some Caribbean cruises for October?,What do you want out of a cruise? To relax on ...
4,<p>Singapore Airlines has an all-business clas...,4,,1,8,<loyalty-programs><routes><ewr><singapore-airl...,Does Singapore Airlines offer any reward seats...,This route (as well as LAX-SIN) is being cance...


In [2]:
from bs4 import BeautifulSoup

In [206]:
# Remove HTML from "Body" column
posts_comments["Body_clean"] = posts_comments["Body"].map(lambda x: BeautifulSoup(str(x),"html.parser").get_text())

In [207]:
posts_comments.Body_clean[8]

u"I'm planning on taking the trans-Siberian / trans-Mongolian from Moscow to Beijing via Ulaan Bataar next year and I'd like some advice on how best to organise the visa situation. \nI'm a British citizen, partner is Swedish. We'll need visas for Russia, China and Mongolia. Seeing as you can only book the train tickets like 3 months in advance and you need to get all the visas together in that time as well, the process seems likely to be a bit complicated. Especially if you end up getting declined for a visa. \nWhat is the best process or method for obtaining the visas (in which country order) and is there any kind of trustworthy service that will do it for me? If I go through a service, what happens if one of my visas is declined?\n"

In [192]:
posts_comments.CommentText[8]

'If at all possible, save yourself the hassle of applying for a Russian Visa through the Russian embassy in Ulan Bator, Mongolia'

In [221]:
import sys
reload(sys)
sys.setdefaultencoding('utf8')

In [231]:
#COMBINE POST AND COMMENT TEXT INTO ONE COLUMN
posts_comments['AllText'] = posts_comments[['Body_clean', 'CommentText']].apply(lambda x: ''.join(x), axis=1)

In [232]:
posts_comments.AllText[8]

u"I'm planning on taking the trans-Siberian / trans-Mongolian from Moscow to Beijing via Ulaan Bataar next year and I'd like some advice on how best to organise the visa situation. \nI'm a British citizen, partner is Swedish. We'll need visas for Russia, China and Mongolia. Seeing as you can only book the train tickets like 3 months in advance and you need to get all the visas together in that time as well, the process seems likely to be a bit complicated. Especially if you end up getting declined for a visa. \nWhat is the best process or method for obtaining the visas (in which country order) and is there any kind of trustworthy service that will do it for me? If I go through a service, what happens if one of my visas is declined?\nIf at all possible, save yourself the hassle of applying for a Russian Visa through the Russian embassy in Ulan Bator, Mongolia"

In [229]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
stop_words = list(ENGLISH_STOP_WORDS)
print stop_words

In [233]:
vectorizer = CountVectorizer(stop_words=stop_words)
X = vectorizer.fit_transform(posts_comments['AllText'])

vectorizer.vocabulary_

{u'constan\u021ba': 23526,
 u'vang': 78426,
 u'localizer': 47254,
 u'gemmologist': 35583,
 u'woods': 81419,
 u'spiders': 69585,
 u'3027622187': 4623,
 u'hanging': 37830,
 u'woody': 81425,
 u'trawling': 75488,
 u'comically': 22672,
 u'localized': 47253,
 u'regularize': 62561,
 u'disobeying': 27716,
 u'canes': 19664,
 u'kawasakis': 44275,
 u'reclear': 62048,
 u'chanthaburi': 20823,
 u'refunding': 62441,
 u'caned': 19663,
 u'crossbar': 24766,
 u'beusselstr': 16692,
 u'touristed': 74844,
 u'user4550': 78049,
 u'wikiloc': 81011,
 u'touristen': 74845,
 u'bratislava': 18123,
 u'replaces': 63047,
 u'makemytrip': 48302,
 u'bringing': 18347,
 u'wooded': 81412,
 u'goresbrook': 36453,
 u'grueling': 37057,
 u'wooden': 81413,
 u'wednesday': 80509,
 u'circuitry': 21677,
 u'crossrate': 24788,
 u'salacgriva': 65239,
 u'330ml': 4928,
 u'bbqs': 16007,
 u'chiasso': 21192,
 u'thrace': 73863,
 u'maytham': 49216,
 u'targu': 72663,
 u'geocahe': 35698,
 u'snuggles': 68842,
 u'0050': 77,
 u'shorthaul': 67496,
 

In [238]:
len(vectorizer.vocabulary_.keys())

83943

In [243]:
documents = posts_comments.AllText

In [245]:
import re

In [246]:
# remove words that appear only once
frequency = defaultdict(int)

for text in documents:
    for token in re.split('\s', text):
        frequency[token] += 1

texts = [[token for token in text.split() if frequency[token] > 1 and token not in stop_words]
          for text in documents]

# Create gensim dictionary object
dictionary = corpora.Dictionary(texts)

# Create corpus matrix
corpus = [dictionary.doc2bow(text) for text in texts]

In [266]:
lda = models.ldamodel.LdaModel(
    corpus = corpus,
    id2word=dictionary,
    num_topics = 5,
    passes = 2)

In [283]:
lda.print_topics()

[(0,
  u'0.011*"flight" + 0.010*"ticket" + 0.009*"The" + 0.008*"train" + 0.008*"-" + 0.006*"tickets" + 0.006*"If" + 0.006*"airport" + 0.005*"flights" + 0.005*"airline"'),
 (1,
  u'0.008*"The" + 0.007*"If" + 0.007*"use" + 0.006*"I" + 0.005*"it\'s" + 0.005*"In" + 0.004*"just" + 0.004*"You" + 0.004*"people" + 0.004*"card"'),
 (2,
  u'0.112*"I" + 0.011*"I\'m" + 0.007*"just" + 0.007*"like" + 0.007*"know" + 0.006*"want" + 0.006*"time" + 0.005*"question" + 0.005*"don\'t" + 0.005*"Is"'),
 (3,
  u'0.009*"The" + 0.006*"-" + 0.005*"people" + 0.005*"like" + 0.004*"it\'s" + 0.003*"There" + 0.003*"In" + 0.003*"places" + 0.003*"English" + 0.003*"good"'),
 (4,
  u'0.020*"visa" + 0.010*"passport" + 0.009*"I" + 0.009*"UK" + 0.008*"The" + 0.008*"Schengen" + 0.008*"US" + 0.007*"need" + 0.006*"country" + 0.005*"travel"')]

In [286]:
posts.Tags.value_counts().nlargest(20)

<visas><schengen>                 120
<visas>                            64
<visas><uk>                        64
<air-travel>                       61
<schengen>                         45
<visas><usa>                       29
<passports>                        20
<transit>                          20
<air-travel><luggage>              17
<customs-and-immigration>          17
<luggage>                          17
<air-travel><tickets>              16
<visas><usa><b1-b2-visas>          15
<usa>                              14
<visas><transit>                   14
<visas><china>                     14
<air-travel><health>               14
<usa><b1-b2-visas>                 13
<usa><customs-and-immigration>     13
<visas><uk><visa-refusal>          12
Name: Tags, dtype: int64

#### Schengen Visas (last topic) and Flight/ticket/train/airline (first topic) make sense and correspond to some of the most frequent tags. 

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 2. What makes an answer likely to be "accepted"?

---

**2.1 Build a model to predict whether a post will be marked as the answer.**

- This is a classification problem.
- You're free to use any of the machine learning algorithms or techniques we have learned in class to build the best model you can.
- NLP will be very useful here for pulling out useful and relevant features from the data. 
- Though not required, using bagging and boosting models like Random Forests and Gradient Boosted Trees will _probably_ get you the highest performance on the test data (but who knows!).


**2.2 Evaluate the performance of your classifier with a confusion matrix and accuracy. Explain how your model is performing.**

**2.3 Plot either a ROC curve or precision-recall curve (or both!) and explain what they tell you about your model.**

NOTE: You should only be predicting this for `PostTypeID=2` posts, which are the "answer" posts. This doesn't mean, however, that you can't or shouldn't use the parent questions as predictors!


In [3]:
posts2 = pd.read_csv('/Users/sashakapralov/Desktop/DSI-SF-5-Working/datasets/stack_exchange_travel/posts_train.csv')

In [4]:
posts2.columns

Index([u'AcceptedAnswerId', u'AnswerCount', u'Body', u'ClosedDate',
       u'CommentCount', u'CommunityOwnedDate', u'CreationDate',
       u'FavoriteCount', u'Id', u'LastActivityDate', u'LastEditDate',
       u'LastEditorDisplayName', u'LastEditorUserId', u'OwnerDisplayName',
       u'OwnerUserId', u'ParentId', u'PostTypeId', u'Score', u'Tags', u'Title',
       u'ViewCount'],
      dtype='object')

In [5]:
#identify question posts with identifiers of accepted answers
qs_w_acc_ans = posts2[posts2.AcceptedAnswerId.notnull()]
#take accepted answer ID and question text columns
qs_w_acc_ans = qs_w_acc_ans[['AcceptedAnswerId','Body']]
#rename columns
qs_w_acc_ans.rename(columns={'Body': 'QBody', 'AcceptedAnswerId': 'Id'}, inplace=True)
#add column that will identify accepted answers
qs_w_acc_ans['Accepted'] = 1

In [6]:
#extract all answer posts
all_answers = posts2[posts2.PostTypeId == 2]

In [7]:
#merge accepted answer IDs and corresponding questions with full set of answers
all_answers = pd.merge(all_answers, qs_w_acc_ans, how='left', on='Id')

In [8]:
#mark non-accepted answers as such
all_answers.Accepted.fillna(0, inplace=True)

In [9]:
all_answers.Accepted.value_counts()

0.0    17451
1.0     6516
Name: Accepted, dtype: int64

In [10]:
all_answers.dtypes

AcceptedAnswerId         float64
AnswerCount              float64
Body                      object
ClosedDate                object
CommentCount               int64
CommunityOwnedDate        object
CreationDate              object
FavoriteCount            float64
Id                         int64
LastActivityDate          object
LastEditDate              object
LastEditorDisplayName     object
LastEditorUserId         float64
OwnerDisplayName          object
OwnerUserId              float64
ParentId                 float64
PostTypeId                 int64
Score                      int64
Tags                      object
Title                     object
ViewCount                float64
QBody                     object
Accepted                 float64
dtype: object

In [11]:
all_answers.Accepted = all_answers.Accepted.astype(int)

In [12]:
all_answers.shape

(23967, 23)

In [13]:
# Remove HTML from "QBody" column
all_answers["QBody"] = all_answers["QBody"].map(lambda x: BeautifulSoup(str(x),"html.parser").get_text())

In [14]:
# Remove HTML from "Body" column
all_answers["Body"] = all_answers["Body"].map(lambda x: BeautifulSoup(str(x),"html.parser").get_text())

In [15]:
#COMBINE QUESTION AND ANSWER TEXT INTO ONE COLUMN
all_answers['AllText'] = all_answers[['Body', 'QBody']].apply(lambda x: ''.join(x), axis=1)

In [16]:
all_answers.AllText = all_answers.AllText.apply(lambda x: x.replace('\n',' '))

In [17]:
all_answers.AllText.size

23967

In [18]:
all_answers.AllText[22]

u"While Europe does have per country operators, there are some operators that will provide very low roaming rates. I've not tried any of them though, and not all of them cover the whole of Europe yet. Be aware that some countries (such as the UK) have a very competitive Pay As You Go market with very cheap flexible sims available with bundled data etc, while other countries, such as France, are a lot less competitive.  Possible Duplicate: What are the best ways to avoid data roaming fees when travelling abroad?   Is where any default cheap mobile provider across Europe? I want to buy sim-card in one country (East or Nord Europe), and use it in other (West Europe). Is it possible? "

In [19]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [20]:
X_text_col = all_answers['AllText']

# Preprocess text data to Tfidf
vect = TfidfVectorizer(stop_words='english',ngram_range=(1,2),min_df=10,max_df=100)
X_text = vect.fit_transform(X_text_col)

In [21]:
X_text.shape

(23967, 22611)

In [22]:
X_txt = X_text.toarray()

In [23]:
X_txt.shape

(23967, 22611)

In [24]:
all_answers.CommentCount.value_counts().nlargest(10)

0    11341
1     4190
2     2955
3     1807
4     1134
5      764
6      553
7      367
8      238
9      150
Name: CommentCount, dtype: int64

In [25]:
X_nontext = all_answers['CommentCount'].values
X_nontext.shape

(23967,)

In [26]:
X_nontxt = X_nontext.reshape(23967,1)

In [27]:
import numpy as np

In [28]:
X = np.concatenate((X_txt, X_nontxt), axis=1)

In [29]:
Y = all_answers.Accepted.values
Y.shape

(23967,)

In [30]:
Y_vls = Y.reshape(23967,1)

In [31]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.metrics import confusion_matrix, roc_curve, auc, precision_recall_curve, average_precision_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import scale, MinMaxScaler, normalize
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.cross_validation import cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler



In [56]:
cv_indices = StratifiedKFold(Y, n_folds = 5, random_state=1234)
logreg = LogisticRegression()
lr_scores = []

for train_inds, test_inds in cv_indices:
    
    Xtr, ytr = X[train_inds, :], Y[train_inds]
    Xte, yte = X[test_inds, :], Y[test_inds]
    
    logreg.fit(Xtr, ytr)
    lr_scores.append(logreg.score(Xte, yte))
    
print 'Logistic Regression:'
print lr_scores
print np.mean(lr_scores)
print 'Baseline accuracy:', np.mean(Y)

 Logistic Regression:
[0.74556830031282584, 0.74024619236386402, 0.73857709159190488, 0.74650532025871064, 0.74441894429376176]
0.743063169764
Baseline accuracy: 0.271873826511


In [None]:
ss = StandardScaler()
Xn = ss.fit_transform(X)
lr_params = {
    'penalty':['l1','l2'],
    'solver':['liblinear'],
    'C':np.linspace(0.0001, 1000, 10)
}
lr_gs = GridSearchCV(LogisticRegression(), lr_params, cv=5, verbose=1)
lr_gs.fit(Xn, Y)
print lr_gs.best_params_
best_lr = lr_gs.best_estimator_

Fitting 5 folds for each of 20 candidates, totalling 100 fits


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 3. What is the score of a post?

---

**3.1 Build a model that predicts the score of a post.**

- This is a regression problem now. 
- You can and should be predicting score for both "question" and "answer" posts, so keep them both in your dataset.
- Again, use any techniques that you think will get you the best model.

**3.2 Evaluate the performance of your model with cross-validation and report the results.**

**3.3 What is important for determining the score of a post, if anything?**


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 4. How many views does a post have?

---

**4.1 Build a model that predicts the number of views a post has.**

- This is another regression problem. 
- Predict the views for all posts, not just the "answer" posts.

**4.2 Evaluate the performance of your model with cross-validation and report the results.**

**4.3 What is important for the number of views a post has, if anything?**

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 5. Build a pipeline or other code to automate evaluation of your models on the test data.

---

Now that you've constructed your three predictive models, build a pipeline or code that can easily load up the raw testing data and evaluate your models on it.

The testing data that is held out is in the same raw format as the training data you have. _Any cleaning and preprocessing that you did on the training data will need to be done on the testing data as well!_

This is a good opportunity to practice building pipelines, but you're not required to. Custom functions and classes are fine as long as they are able to process and test the new data.


<img src="http://imgur.com/xDpSobf.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 6. Lets Model - Tournament for stock market predictions

>Start this section of the project by downloading the train and test datasets from the following site: https://numer.ai/rules

> - The data set is clean, your goal is to develop a classification model(s) 
> - Report all the results including log loss, and other coefficients you consider iteresting