## Capstone project Pricify

#### Business Understanding

The goal is to build a model to predict a price for item by picture. As OfferUp says that "With a single snap, you can take a photo of an item and instantly circulate it to people nearby.", it sounds interesting to suggest the price, so you can take a photo of item and decide either you should sell it or not.

### Data Understanding

OfferUp scraping

scrap_offerup.rb  - HTTP Request to https://offerupnow.com/ and scrap the recent offers page by page until the date limit was reached. Each offer was stored into file in JSON format. 

normalize_scraped.rb - Spliting and combining offers into 3 JSON files, I got:
offerup-data
* items.json 605 Mb 380,107 items
* owners.json 141.2 Mb ? item owners
* images.json 214.1 Mb ? links to item images

download_images.rb - 3 size of images organizes into subfolders with offer-id name.
* detail
* full
* list


./scrap_offerup.rb -c /tmp/cookie.txt -s ./scraped -t 0 -n 20000 -p 0.5 | tee scrap.log

In [None]:
import pandas as pd
import matplotlib
%matplotlib inline

In [None]:
df = pd.read_json('../data/items.json')

In [None]:
cdf = df.copy()

In [None]:
cdf.shape

In [None]:
cdf.head()

In [None]:
sample = cdf.sample(100)

In [None]:
sample.shape

In [None]:
cdf.info()

#### Remove duplicates

From 380107 to 300304

In [None]:
cdf.drop_duplicates(subset='id', inplace=True)

In [None]:
cdf.shape

### Data Preparation

Start from item.json. We got 300,304 rows and 34 columns. 

Columns are: 

* category - object (calculate how many categories, split into separate table?)
* condition	 - int (40, 100 ? looks like categorical)
* description - text
* distance	- distance from logged user, not applicable
* get_full_url	- link to offer
* get_img_medium_height	400
* get_img_medium_width	300
* get_img_permalink_large
* get_img_permalink_medium
* get_img_permalink_small
* get_img_small_height
* get_img_small_width
* get_small_square_thumbanil
* id	65194613
* image	None
* image_mob_det_hd
* image_mob_list_hd
* latitude	47.8426
* listing_type	2
* location_name	Lynnwood, WA
* longitude	-122.295
* owner_id	6787474
* payable	False
* post_date	2015-12-19T19:38:50.398Z
* post_from_store_address	Lynnwood, WA
* price	25
* priority	100
* reservable	False
* reserved	False
* review_status	2
* sold_date	None
* sold_offer_id	NaN
* state	3
* title

1. Category
 - category 380107 non-null object - no missing values
 - Create two features with category data - category_id and category_name

In [None]:
cdf['category_id'] =  cdf.category.apply(lambda x: int(x['id']))

In [None]:
cdf['category_name'] =  cdf.category.apply(lambda x: str(x['name']))

In [None]:
#Remove outliers
cdf = cdf[cdf['price'] < 1500]

In [None]:
cdf.shape

In [None]:
cat_counts = cdf['category_name'].value_counts()

In [None]:
len(cat_counts)

In [None]:
cat_counts

So, we have 37 categories of items

In [None]:
cat_counts.plot(kind='density')

In [None]:
#Selected features: 
features = ['id', 'description', 'title', 'category_id', 'category_name', 'price']
fdf = cdf[features]

Split data into categories:

In [None]:
phones_category = fdf[fdf['category_name'] == 'Cell Phones']
phones_category.shape

In [None]:
phones_category.head()

In [None]:
apparel_category = fdf[fdf['category_name'].isin(['Baby & Kids', 'Clothing & Shoes'])]
apparel_category.shape
#apparel_category.id.to_csv('apparel_category.csv', index=False)


In [None]:
apparel_category.head()

In [None]:
house_category = fdf[fdf['category_name'].isin(['Furniture', 'Household', 'Home & Garden'])]
house_category.shape
#house_category.id.to_csv('house_category.csv', index=False)

In [None]:
house_category.head()

#### Let see the most common words in titles

In [None]:
phones_base = cdf[cdf['category_name'] == 'Cell Phones']
from collections import Counter
l = map(lambda x: x.split(), phones_base.title.tolist())
c = Counter([item.lower() for sublist in l for item in sublist])
c.most_common(10)

#### Most common in description

In [None]:
from collections import Counter
l = map(lambda x: x.split(), phones_base.description.tolist())
c = Counter([item.lower() for sublist in l for item in sublist])
c.most_common(10)

In [None]:
from collections import Counter
l = map(lambda x: x.split(), apparel_category.title.tolist())
c = Counter([item.lower() for sublist in l for item in sublist])
c.most_common(10)

In [None]:
from collections import Counter
l = map(lambda x: x.split(), apparel_category.description.tolist())
c = Counter([item.lower() for sublist in l for item in sublist])
c.most_common(10)

In [None]:
from collections import Counter
l = map(lambda x: x.split(), house_category.title.tolist())
c = Counter([item.lower() for sublist in l for item in sublist])
c.most_common(10)

In [None]:
from collections import Counter
l = map(lambda x: x.split(), house_category.description.tolist())
c = Counter([item.lower() for sublist in l for item in sublist])
c.most_common(10)

In [None]:
def key_words_check(title, words):
    if len(set(title.split()).intersection(words)) > 0:
        return True
    return False

In [None]:
phone_words = set(['unlocked', 'iphone', 'galaxy', 'samsung', 'note', 'phone', 'lg', 'htc', 'verizon', 't-mobile', 'at&t', 
              'tmobile','nokia', 'mobile', 'sony', 'motorola', 'unlocled', 'lumia', 'smart', 'phones', 'nexus'])


In [None]:
#phones_category = cdf[cdf['category_name'].isin(['Cell Phones', 'Electronics'])]
#phones_category = phones_category[phones_category['title'].apply(lambda x: key_words_check(x, phone_words))]
#phones_category.shape
#phones_category.id.to_csv('phones_category.csv', index=False)

In [None]:
phones_category = fdf[fdf['category_name'] == 'Cell Phones']
phones_category.id.to_csv('phones_category.csv', index=False)
phones_category.shape

### Features

#### Condition
  
* condition 380107 non-null int64 - no missing values
* value range: [0, 20, 40, 60, 80, 100], so it looks like categorical.
* Create 6 features with conditions

In [None]:
condition_counts = cdf['condition'].value_counts()

In [None]:
condition_counts

In [None]:
condition_counts.plot(kind='density')

In [None]:
for i in  [0, 20, 40, 60, 80, 100]:
    cdf['condition_' + str(i)] =  cdf.condition.apply(lambda x: 1 if x == i else 0 )

#### Description, title
  
* description                   380107 non-null object - no missing values
* title                         380107 non-null object
* vectorize with TF-IDF

#### Distance
  
* distance                      380107 non-null int64 - no missing values
* It looks like distance was calculated based on location from cookie file, so from my home, it's not relevant in general.

#### Owner
* owner_id                      380107 non-null int64
* Information about owner looks valueble, but in case of consignments it's not likely that the same person would sell the same tipe of item again and again. I will skip owners information.

#### Priority
* priority                      380107 non-null int64
* All observation have the same priority 100, will not use it

#### Reservable, reserved
* reservable                    380107 non-null bool
* reserved                      380107 non-null bool
* Only two offer has reservable=True, will not use it
* All offers have reserved=False, will not use it

In [None]:
cdf['payable'].value_counts()

#### Price
  
* price                         380107 non-null float64 - no missing values
* target

### Save for later

* latitude                      380107 non-null float64
* listing_type                  380107 non-null int64
* location_name                 380107 non-null object
* longitude                     380107 non-null float64
* payable                       380107 non-null bool
* post_date                     380107 non-null object
* post_from_store_address       377435 non-null object
* review_status                 285697 non-null float64
* sold_date                     6486 non-null object
* sold_offer_id                 3862 non-null float64
* state                         380107 non-null int64


### Other

Doesn't look relevant

* get_full_url                  380107 non-null object
* get_img_medium_height         380107 non-null int64
* get_img_medium_width          380107 non-null int64
* get_img_permalink_large       380107 non-null object
* get_img_permalink_medium      380107 non-null object
* get_img_permalink_small       380107 non-null object
* get_img_small_height          380107 non-null int64
* get_img_small_width           380107 non-null int64
* get_small_square_thumbanil    380107 non-null object
* id                            380107 non-null int64
* image                         159857 non-null object
* image_mob_det_hd              380107 non-null object
* image_mob_list_hd             380107 non-null object

### As a result I will use next data:

['id', 'description', 'title', 'category_id', 'category_name', 'price']

# First Model - category classifier


I will build recommendation model for each category. Lets return to the whole dataset and try to predict item class based on 'title', 'description' and 'deep_features' from product picture. For this we need to train our model on whole data set. There are two options for the target:
1. We stay with current categories, so try to predict one from ['Cell Phones', 'Baby & Kids', 'Clothing & Shoes', 'Games & Toys', 'Furniture', 'Household', 'Home & Garden'] and then select a next model based on these on from 7 category. 
2. Create new target with values ['phones', 'apparel', 'house']
Let's comapare these two models.


First, split data into test and train subsets. 


#### Create new target

In [None]:
def set_category_name(category):
    if category == 'Cell Phones':
        return 'phones'
    elif category in ['Furniture', 'Household', 'Home & Garden']:
        return 'home'
    elif category in ['Baby & Kids', 'Clothing & Shoes']:
        return 'apparel'
    else:
        return 'other'

In [None]:
fdf['category'] =  fdf.category_name.apply(lambda x: set_category_name(x))

In [None]:
all_categories = fdf[fdf.category != 'other']
categories = [
        'home',
        'phones',
        'apparel'
    ]

In [None]:
all_target_small = all_categories['category_name']
all_target_big = all_categories['category']
all_text = all_categories.title + " " + all_categories.description

In [None]:
from sklearn.cross_validation import train_test_split

all_big_train, all_big_test, target_big_train, target_big_test = train_test_split(all_text, all_target_big, test_size=0.2, random_state=55)
all_small_train, all_small_test, target_small_train, target_small_test = train_test_split(all_text, all_target_small, test_size=0.2, random_state=55)

### Features

For both models features will be the same. 

#### Bag of Words or “Bag of n-grams” representation

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import RidgeClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils.extmath import density
from sklearn import metrics

from __future__ import print_function

import logging
import numpy as np
from optparse import OptionParser
import sys
from time import time
import matplotlib.pyplot as plt

In [None]:
text_train, text_test, y_train, y_test = all_big_train, all_big_test, target_big_train, target_big_test

In [None]:
text_test.head()

In [None]:
# split a training set and a test set


print("Extracting features from the training data using a sparse vectorizer")
t0 = time()
#vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
#                                 stop_words='english')
vectorizer = TfidfVectorizer(input='content', lowercase=True, tokenizer=None,
                                    stop_words='english', use_idf=True,
                                    max_features=1000, ngram_range=(1, 3))
X_train = vectorizer.fit_transform(text_train)
duration = time() - t0

print("n_samples: %d, n_features: %d" % X_train.shape)
print()

print("Extracting features from the test data using the same vectorizer")
t0 = time()
X_test = vectorizer.transform(text_test)
duration = time() - t0

print("n_samples: %d, n_features: %d" % X_test.shape)
print()

# mapping from integer feature name to original token string
feature_names = vectorizer.get_feature_names()


t0 = time()
ch2 = SelectKBest(chi2)
X_train = ch2.fit_transform(X_train, y_train)
X_test = ch2.transform(X_test)
if feature_names:
    # keep selected feature names
    feature_names = [feature_names[i] for i
                         in ch2.get_support(indices=True)]
print("done in %fs" % (time() - t0))
print()

if feature_names:
    feature_names = np.asarray(feature_names)


def trim(s):
    """Trim string to fit on terminal (assuming 80-column display)"""
    return s if len(s) <= 80 else s[:77] + "..."

In [None]:
# Benchmark classifiers
def benchmark(clf, ):
    print('_' * 80)
    print("Training: ")
    print(clf)
    t0 = time()
    clf.fit(X_train, y_train)
    train_time = time() - t0
    print("train time: %0.3fs" % train_time)

    t0 = time()
    pred = clf.predict(X_test)
    test_time = time() - t0
    print("test time:  %0.3fs" % test_time)

    score = metrics.accuracy_score(y_test, pred)
    print("accuracy:   %0.3f" % score)

    if hasattr(clf, 'coef_'):
        print("dimensionality: %d" % clf.coef_.shape[1])
        print("density: %f" % density(clf.coef_))

        
        print("top 10 keywords per class:")
        for i, category in enumerate(categories):
            top10 = np.argsort(clf.coef_[i])[-10:]
            print(trim("%s: %s"
                    % (category, " ".join(feature_names[top10]))))
        print()

    
    print("classification report:")
    print(metrics.classification_report(y_test, pred,
                                            target_names=categories))

    
    print("confusion matrix:")
    print(metrics.confusion_matrix(y_test, pred))

    print()
    clf_descr = str(clf).split('(')[0]
    return clf_descr, score, train_time, test_time


In [None]:
results = []
for clf, name in (
        (RidgeClassifier(tol=1e-2, solver="lsqr"), "Ridge Classifier"),
        (Perceptron(n_iter=50), "Perceptron"),
        (PassiveAggressiveClassifier(n_iter=50), "Passive-Aggressive"),
        #(KNeighborsClassifier(n_neighbors=10), "kNN"),
        (RandomForestClassifier(n_estimators=100), "Random forest")):
    print('=' * 80)
    print(name)
    results.append(benchmark(clf))

for penalty in ["l2", "l1"]:
    print('=' * 80)
    print("%s penalty" % penalty.upper())
    # Train Liblinear model
    results.append(benchmark(LinearSVC(loss='l2', penalty=penalty,
                                            dual=False, tol=1e-3)))

    # Train SGD model
    results.append(benchmark(SGDClassifier(alpha=.0001, n_iter=50,
                                           penalty=penalty)))

# Train SGD with Elastic Net penalty
print('=' * 80)
print("Elastic-Net penalty")
results.append(benchmark(SGDClassifier(alpha=.0001, n_iter=50,
                                       penalty="elasticnet")))

# Train NearestCentroid without threshold
print('=' * 80)
print("NearestCentroid (aka Rocchio classifier)")
results.append(benchmark(NearestCentroid()))

# Train sparse Naive Bayes classifiers
print('=' * 80)
print("Naive Bayes")
results.append(benchmark(MultinomialNB(alpha=.01)))
results.append(benchmark(BernoulliNB(alpha=.01)))

print('=' * 80)
print("LinearSVC with L1-based feature selection")
# The smaller C, the stronger the regularization.
# The more regularization, the more sparsity.
results.append(benchmark(Pipeline([
  ('feature_selection', LinearSVC(penalty="l1", dual=False, tol=1e-3)),
  ('classification', LinearSVC())
])))

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
def build_text_features(X_train, X_test, Y, Y_test):
for max_features in [1000, 5000, 10000]:
    vectorizer = TfidfVectorizer(lowercase=True,stop_words='english', max_features=max_features, ngram_range=(1, 3))
        
    train_tf_idf = vectorizer.fit_transform(X_train)
    test_tf_idf = vectorizer.transform(X_test)

    for alpha in [1.0, 0.5, 0.1, 1e-09, 0.0]:
            # initiate model as per grid params
            desc_nb_model = MultinomialNB(alpha=alpha)

            desc_nb_model.fit(desc_tfidf_train, y_train)

            print 'accuracy: {}, alpha: {}, max_features: {}'.format(desc_nb_model.score(desc_tfidf_test, y_test), alpha, max_features)

In [None]:
vectorizer.get_feature_names()

In [None]:
def grid_search_nlp(X_train, X_test, y_train, y_test, textual_data='desc_init'):
    """Grid search TfIdf vectorizer and Multinomial NB for best accuracy on text data."""
    print textual_data
    for max_features in [30000, 20000, 10000]:
        # initiate vectorizer as per grid params
        desc_vect = TfidfVectorizer(input='content', lowercase=True, tokenizer=None,
                                    stop_words='english', use_idf=True,
                                    max_features=max_features, ngram_range=(1, 3))
        desc_tfidf_train = desc_vect.fit_transform(X_train[textual_data])
        desc_tfidf_test = desc_vect.transform(X_test[textual_data])

        for alpha in [1.0, 0.5, 0.1, 1e-09, 0.0]:
            # initiate model as per grid params
            desc_nb_model = MultinomialNB(alpha=alpha)

            desc_nb_model.fit(desc_tfidf_train, y_train)

            print 'accuracy: {}, alpha: {}, max_features: {}'.format(desc_nb_model.score(desc_tfidf_test, y_test), alpha, max_features)

Switch to GraphLab Create.

In [None]:
import graphlab as gl
gl.canvas.set_target('ipynb')

In [None]:
apparel = gl.SFrame(apparel_category)
house = gl.SFrame(house_category)
phones = gl.SFrame(phones_category)

On the last step I'm going to use graphlab.nearest_neighbors to get top 5 nearest offers to display for user and choose median price value as recommendation. 

In [None]:
gramms = ['new', 'used', 'unlocked', 'good condition', 'great condition', 'very good condition', 'never used']

Start with phones

In [None]:
phones['title_word_count'] = gl.text_analytics.tf_idf(phones['title'])
#phones['desc_word_count'] = gl.text_analytics.count_words(phones['description'])

In [None]:
pl = gl.text_analytics.count_ngrams(phones['description'], 2)
c = Counter()
for row in pl: 
    for key, value in row.iteritems():
        c[key] += value
c.most_common(50)

In [None]:
pl = gl.text_analytics.count_ngrams(apparel['title'], 2)
c = Counter()
for row in pl: 
    for key, value in row.iteritems():
        c[key] += value
c.most_common(30)

In [None]:
pl[0].keys()

In [None]:
apparel_model = gl.nearest_neighbors.create(apparel_category,features=['deep_features'],label='id')
house_model = gl.nearest_neighbors.create(house_category,features=['deep_features'],label='id')
phone_model = gl.nearest_neighbors.create(phone_category,features=['deep_features'],label='id')

Add additional features for each category from title and description.

In [None]:
## TF/IDF
    #vectorizer1 = TfidfVectorizer(encoding='english',
    #                            stop_words='english',
    #                            strip_accents="ascii",
    #                          # token_pattern=r'\w{3,}',
    #                           max_features=100)

    text_vec = df['description'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())
    text_vec1 = zip(text_vec, df['name'], df['org_name'], df['payee_name'], df['org_desc'])
    text_vec1 = [ ''.join(ln) for ln in text_vec1]
    count_char = pd.Series(text_vec1)
    df["Numberof!"]    = count_char.apply(lambda x: x.count("!"))
    df["NumberofCaps"] = count_char.apply(lambda x: sum(1 for c in x if c.isupper()))
    
    tfidf_vec = joblib.load(tfidf_file)

    r       = tfidf_vec.transform(text_vec1)
    columns = tfidf_vec.get_feature_names()
    columns = [ 'tfidf_'+c for c in columns]
    temp    = pd.DataFrame(r.toarray(),columns=columns)
