# Moscow tutor's price prediction

In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [1]:
import pandas as pd
import re
import numpy as np
from ast import literal_eval
import fuzzywuzzy
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In [1]:
tutors_data = pd.read_csv('../input/moscow-tutors/tutors_eng_2021_10_06/tutors_eng_2021_10_06.csv')

In [1]:
tutors_data.shape

In [1]:
tutors_data.head()

Features description:

*   **Categories** - Lists of taught subjects out of 27 subjects
*   **Price** - Price in rub per hour
*   **Score** - Average score based on the reviews, 0.0-5.0
*   **Format** - Lists of working formats. Options: remotely, at the tutor's place, at the student's place
*   **Reviews_number** - Amount of student's reviews in the tutor's profile
*   **Experience** - Experience in years
*   **Status** - Current tutor's status. Options: Private tutor, School teacher, Postgraduate student, Native speaker, University professor, Student, not stated (missing)
*   **Location** - Metro stations or cities on Moscow region
*   **Tags** - Tutor's services. They are stated by tutors and can differs. They are remained in Russian in the dataset
*   **Audience** - Tutor's target audience. For example: students, pupils of 10 grades etc. They are stated by tutors and can differs
*   **Video_presentation** - Video presentation availability
*   **Photo** - Profile photo availability

## Dataset Analysis
---

In [1]:
tutors_data.isnull().sum()

Almost 74 % of the data is missing or empty for 8 last columns. Let's look at the missing rows closely.

In [1]:
cols_with_missings = ['Reviews_number', 'Experience', 'Status', 'Location', 'Tags', 'Audience', 'Video_presentation', 'Photo']
null_data = tutors_data[tutors_data.isnull().any(axis=1)]
null_data

In [1]:
null_data['Format'] = null_data['Format'].apply(lambda s: str(s).replace("at the tutor's","tutors place"))
null_data['Format'] = null_data['Format'].apply(lambda s: str(s).replace("at the student's","students place"))

In [1]:
for i in ['Categories', 'Format']:
    null_data[i] = null_data[i].apply(lambda s: list(literal_eval(str(s))) if s != np.nan else s)

In [1]:
null_expl = null_data.explode('Format')
null_expl

Columns:

    
*   **`Reviews_number`**: There was no data on the page during parsing, so we can claim that these tutors don't have reviews from the students at all, although they have the score. We can fill in 0 for the missings.
*   **`Experience`**: Tutors didn't indicate their experience, so we can assume that they don't have any or have little. Let's fill in 0 for the missings values.
*   **`Status`**: Tutors didn't stated their status. We'll fill the missings with '`-`' that means '`No status`'.
*   **`Location`**: 75 % of the missings have `remote` format of working. Let's assume the rest of the tutors are very mobile and can reach any place of Moscow easily (Heh, I'd look at it). So we'll fill in the missing cells with value '`[]`' denoting all the location of Moscow.
*   **`Tags`**: The tags are written by the tutors and have lots of different values. We won't use this feature for prediction. Just fill in with '`[]`'.
*   **`Audience`**: Assume that the tutors work with all the kind of audience, so fill in with '`[All]`'.
*   **`Video_presentation`** and **`Photo`**: There was no data about these parameters on parsed pages, so we'll fill in the missings with '`No`'.

In [1]:
null_expl.Format.value_counts()

In [1]:
49019 / 65420 * 100

## Data preprocessing
---
Let's start filling in the missings

In [1]:
tutors_data_preprocessed = tutors_data.copy()

In [1]:
tutors_data_preprocessed[['Reviews_number', 'Experience']] = tutors_data_preprocessed[['Reviews_number', 'Experience']].fillna(0)
tutors_data_preprocessed['Status'] = tutors_data_preprocessed['Status'].fillna('-')
tutors_data_preprocessed[['Location', 'Tags']] = tutors_data_preprocessed[['Location', 'Tags']].fillna('[]')
tutors_data_preprocessed['Audience'] = tutors_data_preprocessed['Audience'].fillna('[\'All\']')
tutors_data_preprocessed[['Video_presentation', 'Photo']] = tutors_data_preprocessed[['Video_presentation', 'Photo']].fillna('No')

In [1]:
tutors_data_preprocessed.isnull().sum().sum()

After reading from the csv file the data contains lists as strings. For the further analysis we need to convert the values to lists.

In [1]:
cols_with_lists = ['Categories', 'Format', 'Location', 'Tags', 'Audience']

In [1]:
tutors_data_preprocessed['Format'] = tutors_data_preprocessed['Format'].apply(lambda s: str(s).replace("at the tutor\'s", "tutors place"))
tutors_data_preprocessed['Format'] = tutors_data_preprocessed['Format'].apply(lambda s: str(s).replace("at the student\'s", "students place"))
tutors_data_preprocessed['Location'] = tutors_data_preprocessed['Location'].apply(lambda s: str(s).replace("[\'", '[\"'))
tutors_data_preprocessed['Location'] = tutors_data_preprocessed['Location'].apply(lambda s: str(s).replace("\']", '\"]'))
tutors_data_preprocessed['Location'] = tutors_data_preprocessed['Location'].apply(lambda s: str(s).replace("\', \'", '\", \"'))

In [1]:
for i in cols_with_lists:
    tutors_data_preprocessed[i] = tutors_data_preprocessed[i].apply(lambda s: list(literal_eval(str(s))) if s != np.nan else s)

In [1]:
type(tutors_data_preprocessed.loc[0, 'Format'])

In [1]:
tutors_data_preprocessed.head()

Let's apply one-hot encoding for '**`Categories`**', '**`Format`**' and '**`Audience`**'.

In [1]:
categories_series = tutors_data_preprocessed['Categories'].explode()
categories_series.unique()

In [1]:
tutors_data_preprocessed = tutors_data_preprocessed.join(pd.crosstab(categories_series.index, categories_series))

In [1]:
categories_list = tutors_data_preprocessed.columns[12:39]
tutors_data_preprocessed[categories_list].sum()

The most popular category for tutoring is 'English' - 31 390 tutors on '*repetit.ru*'. The second is 'Mathematics' - 18 629 tutors. And the third is 'Russian' - 11 751 tutors. Does the current generation of russian kids have problems with the native language? Not necessarily, they just prepare for the compulsory exam 'ЕГЭ' (The Unified State Exam, Russian: Единый государственный экзамен, ЕГЭ, Yediniy gosudarstvenniy ekzamen, EGE). Not surprising. let's move on.

In [1]:
format_series = tutors_data_preprocessed['Format'].explode()
format_series.unique()

In [1]:
tutors_data_preprocessed = tutors_data_preprocessed.join(pd.crosstab(format_series.index, format_series))

Now let's look closely at the '**`Audience`**'. We have lots of similar entries here so we need to shorten the list of unique variables. We'll divide this huge amount of similar variables into the following general groups:

*   All
*   Adults
*   Students
*   Pupils of 10-11 grades (Russian schools have 11 grades)
*   Pupils of 5-9 grades
*   Pupils of 1-4 grades
*   Children 6-7 years old
*   Children 4-5 years old
*   Children 1-3 years old

In [1]:
audience_series = tutors_data_preprocessed['Audience'].explode()
audience_list = audience_series.unique()
audience_list

Function to replace rows in the data with similar values

In [1]:
def replace_matches_in_column(df, col, string_to_match, replacing, min_ratio = 50):
    exploded_df = df.explode(column=col)
    strings = exploded_df[col].unique()
    
    matches = fuzzywuzzy.process.extract(string_to_match, strings, 
                                         limit=30, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]
    rows_with_matches = exploded_df[col].isin(close_matches)


    exploded_df.loc[rows_with_matches] = replacing
    imploded_column = exploded_df.groupby(exploded_df.index).agg({col: lambda x: x.tolist()})
    df[col] = imploded_column

Replacing

In [1]:
for i in ['11', '10']:
    replace_matches_in_column(df=tutors_data_preprocessed, col='Audience', string_to_match=i, replacing='Pupils of 10-11 grades', min_ratio=17)
for i in ['9', '8', '7', '6', '5']:
    replace_matches_in_column(df=tutors_data_preprocessed, col='Audience', string_to_match=i, replacing='Pupils of 5-9 grades', min_ratio=10)
replace_matches_in_column(df=tutors_data_preprocessed, col='Audience', string_to_match='4', replacing='Pupils of 1-4 grades', min_ratio=10)
for i in ['3', '2']:
    replace_matches_in_column(df=tutors_data_preprocessed, col='Audience', string_to_match='Pupils of ' + i, replacing='Pupils of 1-4 grades', min_ratio=71)
replace_matches_in_column(df=tutors_data_preprocessed, col='Audience', string_to_match='Children 1-3 года', replacing='Children 1-3 years old', min_ratio=100)

Check '**`Audience`**'

In [1]:
audience_series = tutors_data_preprocessed['Audience'].explode()
audience_list = audience_series.unique()
audience_list

In [1]:
tutors_data_preprocessed = tutors_data_preprocessed.join(pd.crosstab(audience_series.index, audience_series))

Let's drop encoded features with lists. We are also dropping unencoded '**`Location`**' due to covid situation because most people will prefer to work remotely in order not to get infected.

In [1]:
tutors_data_preprocessed = tutors_data_preprocessed.drop(cols_with_lists, axis=1)

In [1]:
tutors_data_preprocessed.head()

It only remains to encode '**`Status`**', '**`Video_presentation`**' and '**`Photo`**' columns.

In [1]:
status_coding = {'Private tutor': '4', 'School teacher': '2', 'Postgraduate student': '3',
       'Native speaker': '6', 'University professor': '5', 'Student': '1', '-': '0'}

In [1]:
for elem in status_coding:
    tutors_data_preprocessed['Status'] = tutors_data_preprocessed['Status'].apply(lambda x: (str(x).replace(elem, status_coding[elem])))
for i in ['Video_presentation', 'Photo']:
    tutors_data_preprocessed[i] = tutors_data_preprocessed[i].apply(lambda x: 1 if str(x) == 'Yes' else 0)

In [1]:
for i in ['Status', 'Reviews_number', 'Experience']:
    tutors_data_preprocessed = tutors_data_preprocessed.astype({i: 'int64'})

In [1]:
tutors_data_preprocessed.head()

## Building a model
---

In [1]:
tutors_data_preprocessed.Score.value_counts()

We can guess that a student or student's parents won't choose a tutor with score less than 4.0. After filtering by score we'll drop this feature to avoid target leakege. '**`Price`**' is the target so we'll drop it too for X.

In [1]:
X = tutors_data_preprocessed[tutors_data_preprocessed['Score'] >= 4.0].drop(['Price', 'Score'], axis=1)
X.head()

In [1]:
X.shape

In [1]:
y = tutors_data_preprocessed[tutors_data_preprocessed['Score'] >= 4.0].Price
y.head()

In [1]:
y.shape

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor

In [1]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [1]:
X_train.shape

In [1]:
y_train.shape

XGBRegressor

In [1]:
my_model_1 = XGBRegressor(n_estimators=500, learning_rate=0.05, n_jobs=4, max_depth=5)
my_model_1.fit(X_train, y_train, early_stopping_rounds=5, eval_set=[(X_test, y_test)], verbose=False)

In [1]:
preds_1 = my_model_1.predict(X_test)
print("Mean Absolute Error: " + str(mean_absolute_error(preds_1, y_test)))

RandomForestRegressor

In [1]:
my_model_2 = RandomForestRegressor(max_depth=10, max_leaf_nodes=100, n_jobs=4)
my_model_2.fit(X_train, y_train)

In [1]:
preds_2 = my_model_2.predict(X_test)
print("Mean Absolute Error: " + str(mean_absolute_error(preds_2, y_test)))

CrossValidation

In [1]:
scores = -1 * cross_val_score(my_model_1, X, y, cv=5, scoring='neg_mean_absolute_error')

In [1]:
print(scores)
print("Average MAE score (across experiments):" + str(scores.mean()))

## Price estimations
---

Let's create some conditions for predection

In [1]:
data_for_predictions = pd.DataFrame(np.array([[0,7,5,0,1,
                                              0,0,0,0,0,
                                              0,0,0,0,0,0,
                                              1,0,0,0,1,0,
                                              0,1,0,0,0,0,
                                              0,0,0,0,1,0,
                                              0,0,0,0,0,0,
                                              1,1,1,1]]), 
                                    columns=X.columns)

In [1]:
data_for_predictions

Here we have a tutor with 7 years of expirience, working as a university professor, who wants to teach independently pupils and students math, informatics/programming and physics. The tutor has just come to the site '[repetit.ru](https://repetit.ru)' and don't have any reviews. He just has photo in his profile, maybe some descriptions and doesn't want to create a video. The tutor prefers to work remotely

In [1]:
pred_price = my_model_1.predict(data_for_predictions)
pred_price

Thus the start price for the tutor is  approximately 1000 rub per hour. Taking into consideration that the MAE is approximately 300 rub we can apply 1300 rub per hour in the end