## Forecasting with text mining

#### Task description:
***
The goal is to prepare a PoC which predicts the price of used iPhones 11. As the data is web-scraped from an online marketplace, it was created by random people providing different and unstructured information in natural language. It also contains plenty of noise, like non-iPhone offers or multiple items sold through one advertisement. 
There are also multiple different kinds of iPhone 11, which is not directly indicated in the dataset and thus requires analysis of the provided description for each offer.  

#### Task approach:
***
- Understanding and analysis of the original dataset
- Dataset cleaning
- Text mining to categorize offers into types of iPhones
- Exploration of the cleaned dataset
- Preparation of forecasting

#### Data:
***
The data contains web-scraped information from an online marketplace OLX, more precisely offers in Polish created by users of this portal. Offers have details specified by their creators (mainly sellers), like offer titles, descriptions and state of items (condition).  
The prices are expressed in PLN (1 EUR = ~4.5 PLN).

In [2]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import itertools
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.graphics.tsaplots import plot_acf
import matplotlib.pyplot as plt
warnings.resetwarnings()
# import spacy
# nlp = spacy.load('pl_core_news_sm')

In [3]:
# Notebook params

warnings.filterwarnings('ignore', 'statsmodels.tsa.arima_model.ARMA', FutureWarning)
warnings.filterwarnings('ignore', 'statsmodels.tsa.arima_model.ARIMA', FutureWarning)

plt.style.use('ggplot')
plt.rcParams["figure.figsize"] = (15,10)
plt.rcParams.update({'font.size': 15})

In [5]:
## Data reading
org_data = pd.read_csv('data.csv')
print('Dataset shape:', org_data.shape)
org_data.head(5)

Dataset shape: (5084, 13)


Unnamed: 0,http,voivodeship,scrap_time,name,price,brand,condition,offer_from,type,description,added_at,views,user_since
0,https://www.olx.pl/oferta/iphone-11-64-jak-now...,pomorskie,2021-02-22 06:55:30,Iphone 11 64 jak nowy 95% gwarancja wyświetlacz,2799.0,iPhone,Używane,Osoby prywatnej,Sprawny,Jak nowy . Kondycja baterii 95%. Kupiony w med...,2021-02-22 00:09:00,37,2013-05-01 00:00:00
1,https://www.olx.pl/oferta/skup-uszkodzonych-te...,pomorskie,2021-02-22 06:55:34,Skup uszkodzonych telefonów iPhone xs xs max 1...,,,,Firmy,,Witam. Kupię uszkodzone/ zablokowane/ zalane/...,2021-02-22 00:05:00,5242,2020-04-01 00:00:00
2,https://www.olx.pl/oferta/iphone-11-64-gb-czar...,pomorskie,2021-02-22 06:55:40,"IPhone 11 64 GB czarny, idealny z gwarancją. W...",2700.0,iPhone,Używane,Osoby prywatnej,Sprawny,Witam! Mam na sprzedaż iPhone’a 11 w wersji 64...,2021-02-21 19:00:00,186,2014-12-01 00:00:00
3,https://www.olx.pl/oferta/iphone-11-CID99-IDIk...,pomorskie,2021-02-22 06:55:44,Iphone 11,3000.0,iPhone,Nowe,Osoby prywatnej,Sprawny,Nowy 128GB Oryginalnie zapakowany kolor czar...,2021-02-21 18:24:00,250,2016-06-01 00:00:00
4,https://www.olx.pl/oferta/jak-nowy-apple-iphon...,pomorskie,2021-02-22 06:55:52,Jak Nowy Apple Iphone 11 256gbGB White Gwarancja,2899.0,iPhone,Używane,Firmy,Sprawny,Witaj. Jesteśmy sklepem - serwisem z 12 le...,2021-02-21 17:38:00,845,2012-08-01 00:00:00


### Dataset exploration

In [8]:
print(org_data.describe())
org_data[org_data['price']>10000]

              price          views
count   4747.000000    5084.000000
mean    2564.713467     763.351692
std      900.714124    2722.697621
min      100.000000       1.000000
25%     2399.000000      68.000000
50%     2689.000000     139.000000
75%     2950.000000     476.000000
max    12345.000000  121752.000000


Unnamed: 0,http,voivodeship,scrap_time,name,price,brand,condition,offer_from,type,description,added_at,views,user_since
609,https://www.olx.pl/oferta/etui-samsung-a31-a51...,dolnoslaskie,2021-01-08 00:02:17,"Etui samsung a31/a51 , iphone 11(pancer)",12345.0,,Nowe,Osoby prywatnej,,Etui Samsung-15 zł iPhone -30 zł Nówki sztu...,2021-01-07 12:35:00,78,2020-11-01 00:00:00


There are some small values present in the data set (price < 1000), but also some very big ones (>10.000). The standard price range of a few years old iPhones is over 1000 PLN, but less than 6000, thus values outside of this range do not make much sense. 

Review of these records showed that these are accessories (eg phone case), different phones or bulk offers. They will be filtered out as the focus is on iPhones 11 only.

In [21]:
# Missing values
print('Unique rows with nulls:', org_data[org_data.isnull().any(axis=1)].count()[0])
org_data.isna().sum()

Unique rows with nulls: 677


http             0
voivodeship      0
scrap_time       0
name             0
price          337
brand          466
condition      131
offer_from       0
type           458
description      0
added_at         0
views            0
user_since       0
dtype: int64

There are missing values in the dataset in price, brand, condition and type columns. While there are around 1400 missing values, they appear in less than 700 rows. This means that often there is more than one missing value per observation. 

This often happens with offers with accessories or with bulk ads, where they buy/sell multiple phones/items. Such offers need to be filtered out. Relevant information could be extracted from individual descriptions 

In [22]:
# Checking unique values of key columns
print('Brand types:', org_data.brand.unique())
print('Condition types:', org_data.condition.unique())
print(org_data.type.unique())

Brand types: ['iPhone' nan 'Inne telefony gsm' 'Samsung' 'LG']
Condition types: ['Używane' nan 'Nowe']
['Sprawny' nan 'Uszkodzony' 'Męskie']


Apart from used iPhones, the dataset contains new phones and also from other manufacturers.

For this task we are interested in just used iPhones.

In [24]:
data = org_data[org_data.brand == 'iPhone']
data = data[data.condition == 'Używane']
data = data[data.type == 'Sprawny']
data = data[~data.price.isna()]

data['date'] = data['added_at'].str[:10]
data = data.drop(columns='added_at')
data = data[['name', 'price', 'brand', 'condition', 'type', 'description', 'date']].copy()

print(data.describe())

             price
count  2754.000000
mean   2635.539887
std     612.208260
min     200.000000
25%    2350.000000
50%    2550.000000
75%    2900.000000
max    6000.000000


Out of 5k records over 2k were not relevant. The dataset is not clean yet, as there are still other iPhones present (like 5s with the price of 200).

### Flagging models of phones

In [28]:
# Exemplary review of name and description fields - they have 
# the relevant information: model type, storage etc 
for field in ['name', 'description']:
    for obs in data[field][:2]:
        print(obs, '\n')

Iphone 11 64 jak nowy 95% gwarancja wyświetlacz 

IPhone 11 64 GB czarny, idealny z gwarancją. Wymiana 

Jak nowy . Kondycja baterii 95%. Kupiony w media markt . Posiadam faktury . Dodatkowa gwarancja na zbity wyświetlacz wartość 600zł . Dodatkowo szkło hartowane 5D oraz pokrowiec SPIGEN wartość 80zl. Nie sprzedaje za granice !! . Polecam 

Witam! Mam na sprzedaż iPhone’a 11 w wersji 64 GB. Telefon jest w stanie idealnym, wręcz jak nowym, ani jednej rysy ma przedzie l, rantach czy tyle. Posiada gwarancje do 17 września 2021 roku. Zakupiony w sieci play przez pierwszego właściciela, odkupiłem telefon jako nowy i wykonałem skany dokumentów osobiście. Zdjęć telefonu nie zamieściłem z powodu braku drugiego telefonu, jednakże wyglada on jak nowy. Założone jest szkło hartowane, lekko pęknięte chyba od kluczy w kieszeni natomiast ubrany jest w siwe etui Apple. Umieszczam screen gwarancji. Wszystkie dokumenty z pudełka i słuchawki nowe, nigdy nie wyjmowane. Bateria kondycja 100% 



For many NLP tasks a good approach would be to use pretrained models. One of popular and well-working libraries is spacy, which offers a ready to use polish model(pl_core_news_sm). This is especially handy for semantic analysis, but also for any tasks that require natural language processing by computers.

But in the case of iPhone the task can be done simpler. There are just a few types of them and using a sophisticated model is not necessarily required, as the results can be similar to text mining approach. This is especially true for a PoC. 

There are usually a few types of one iPhone model. In the case of iPhone 11, these are: standard model (11), 11 Pro and 11 Pro Max, with storages being one of [64, 128, 256, 512]. As this information is crucial when purchasing such phone, these names must be specified in the title/description of each offer.