# Dataset Generation for Trend Analysis

This Notebook contains the steps to create a dataset that holds the initial chefkoch dataset from October 2017 and data changes that have occured over the following 6 months until March 2018. For this analysis, we will neglect the recipes that have been published later than October 2017 for simplifying the data preparation.

To create such a dataset, we first need to load all six data files and add the relevant columns to a single dataset. During this process, we will only leave the columns in the dataset, which are actually changing over time.

In [1]:
import json
import pandas as pd

with open('data/chefkoch_10.json') as data_file:    
    chef10 = json.load(data_file)

In [2]:
def clean_dataset(data):
    date = []
    errors = []
    for i, r in enumerate(data):
        date = data[i]['date'].split('.')
        data[i]['printed'] = data[i]['printed'].split('(')[0].replace('.','')
        data[i]['saved'] = data[i]['saved'].split('(')[0].replace('.','')
        data[i]['shared'] = data[i]['shared'].split('(')[0].replace('.','')

        # For all complete recipes, produce date features
        try:
            data[i]['day'] = int(date[0])
            data[i]['month'] = int(date[1])
            if int(date[2]) < 20:
                data[i]['year'] = 2000+int(date[2])
                data[i]['date'] = date[0]+"."+date[1]+"."+str(data[i]['year'])
            else:
                data[i]['year'] = int(date[2])

        # For all others, set dirty values to None
        except ValueError:
            errors.append(i)
            data[i]['date'] = None
            data[i]['printed'] = 0
            data[i]['saved'] = 0
            data[i]['shared'] = 0
            data[i]['number_ratings'] = 0

    return (len(errors))
    
clean_dataset(chef10)

96493

In [3]:
# Original amount of recipes:
len(chef10)

313751

In [28]:
chefdf = pd.DataFrame(chef10)
chefdf = chefdf.dropna(subset=['date'])
chefdf['date'] = pd.to_datetime(chefdf['date'], format="%d.%m.%Y")
chefdf['avg_rating'] = pd.to_numeric(chefdf['avg_rating'].str.replace('Ø','').str.replace(',','.'))

numerics = ['avg_rating','printed', 'saved', 'shared', 'number_ratings']
for c in numerics:  # convert numeric attributes
        chefdf[c] = pd.to_numeric(chefdf[c])

chefdf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 217258 entries, 0 to 313750
Data columns (total 21 columns):
avg_rating        217258 non-null float64
calories          217258 non-null object
date              217258 non-null datetime64[ns]
day               217258 non-null float64
difficulty        217258 non-null object
ingredients       217258 non-null object
month             217258 non-null float64
name              217258 non-null object
number_ratings    217258 non-null int64
preparation       217258 non-null object
printed           217258 non-null int64
saved             217258 non-null int64
shared            217258 non-null int64
subtitle          113218 non-null object
time_cook         217258 non-null object
time_rest         217258 non-null object
time_work         217258 non-null object
user              217258 non-null object
user_activity     217258 non-null object
user_date         217258 non-null object
year              217258 non-null float64
dtypes: datetime64[n

In [29]:
# Drop dirty attributes that do not provide any value
try:
    chefdf = chefdf.drop(['calories', 'difficulty', 'ingredients', 
           'preparation', 'subtitle', 'time_cook', 'time_rest', 
           'time_work', 'user_activity', 'user_date'], axis=1)
except ValueError:  # if performed already
    pass

chefdf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 217258 entries, 0 to 313750
Data columns (total 11 columns):
avg_rating        217258 non-null float64
date              217258 non-null datetime64[ns]
day               217258 non-null float64
month             217258 non-null float64
name              217258 non-null object
number_ratings    217258 non-null int64
printed           217258 non-null int64
saved             217258 non-null int64
shared            217258 non-null int64
user              217258 non-null object
year              217258 non-null float64
dtypes: datetime64[ns](1), float64(4), int64(4), object(2)
memory usage: 19.9+ MB


This initial DataFrame holds all recipes that have not been affected by the scraper's implementation error.
Most of the dirty attributes have been dropped as they have been proven to be useless regarding their information value.
For these preprocessing steps, we can create one function that can easily be replicated for each dataset.

When joining the newer datasets with the old one, we need a unique index to find the correct data points to join.
For this purpose, the combination of the attributes name and user will be used with the assumption that one user would not publish two recipes with exactly the same name.

In [30]:
chefdf['date'].max()

Timestamp('2017-09-28 00:00:00')

In [31]:
max_date = chefdf['date'].max()
columns = ['calories', 'date', 'difficulty', 'ingredients', 
           'preparation', 'subtitle', 'time_cook', 'time_rest', 
           'time_work', 'user_activity', 'user_date']
numerics = ['avg_rating','printed', 'saved', 'shared', 'number_ratings']

def convert(data):
    df = pd.DataFrame(data)  # convert to DataFrame
    df = df.dropna(subset=['date'])  # Remove incomplete recipes
    df['date'] = pd.to_datetime(df['date'], format="%d.%m.%Y")  # convert date
    df['avg_rating'] = pd.to_numeric(df['avg_rating'].str.replace('Ø','').str.replace(',','.'))  # convert rating
    df = df.loc[df['date'] <= max_date]  # filter newer recipes
    for c in numerics:  # convert numeric attributes
        df[c] = pd.to_numeric(df[c])
    df = df.drop(columns, axis=1)  # drop unnecessary columns
    return df

In [32]:
def clean_new(data):
    date = []
    errors = []
    for i, r in enumerate(data):
        date = data[i]['date'].split('.')
        data[i]['printed'] = data[i]['printed'].split('(')[0].replace('.','')
        data[i]['saved'] = data[i]['saved'].split('(')[0].replace('.','')
        data[i]['shared'] = data[i]['shared'].split('(')[0].replace('.','')

        # For all complete recipes convert printed to numeric
        try:
            data[i]['printed'] = int(data[i]['printed'])
            if int(date[2]) < 20:
                data[i]['date'] = date[0]+"."+date[1]+"."+str(2000+int(date[2]))
            else:
                pass 

        # If not possible, data is dirty
        except ValueError:
            errors.append(i)
            data[i]['date'] = None
            data[i]['printed'] = None
            data[i]['saved'] = None
            data[i]['shared'] = None
            data[i]['number_ratings'] = None

    return (len(errors))

In [33]:
with open('data/chefkoch_11.json') as data_file:    
    chef11 = json.load(data_file)
clean_new(chef11)
chef11df = convert(chef11)
chef11df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 218448 entries, 0 to 314892
Data columns (total 7 columns):
avg_rating        218448 non-null float64
name              218448 non-null object
number_ratings    218448 non-null int64
printed           218448 non-null float64
saved             218448 non-null int64
shared            218448 non-null int64
user              218448 non-null object
dtypes: float64(2), int64(3), object(2)
memory usage: 13.3+ MB


In [34]:
chefdf = pd.merge(chefdf, chef11df,  how='left', left_on=['name', 'user'], 
                  right_on = ['name', 'user'], suffixes=('','_11'))

In [35]:
with open('data/chefkoch_12.json') as data_file:    
    chef12 = json.load(data_file)
clean_new(chef12)
chef12df = convert(chef12)
chef12df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 219344 entries, 0 to 315916
Data columns (total 7 columns):
avg_rating        219344 non-null float64
name              219344 non-null object
number_ratings    219344 non-null int64
printed           219344 non-null float64
saved             219344 non-null int64
shared            219344 non-null int64
user              219344 non-null object
dtypes: float64(2), int64(3), object(2)
memory usage: 13.4+ MB


In [36]:
chefdf = pd.merge(chefdf, chef12df,  how='left', left_on=['name','user'], 
                  right_on = ['name','user'], suffixes=('','_12'))

In [37]:
with open('data/chefkoch_01.json') as data_file:    
    chef01 = json.load(data_file)
clean_new(chef01)
chef01df = convert(chef01)
chef01df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 220259 entries, 0 to 316871
Data columns (total 7 columns):
avg_rating        220259 non-null float64
name              220259 non-null object
number_ratings    220259 non-null int64
printed           220259 non-null float64
saved             220259 non-null int64
shared            220259 non-null int64
user              220259 non-null object
dtypes: float64(2), int64(3), object(2)
memory usage: 13.4+ MB


In [38]:
chefdf = pd.merge(chefdf, chef01df,  how='left', left_on=['name','user'], 
                  right_on = ['name','user'], suffixes=('','_01'))

In [39]:
with open('data/chefkoch_02.json') as data_file:    
    chef02 = json.load(data_file)
clean_new(chef02)
chef02df = convert(chef02)
chef02df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 221080 entries, 0 to 318239
Data columns (total 7 columns):
avg_rating        221080 non-null float64
name              221080 non-null object
number_ratings    221080 non-null int64
printed           221080 non-null float64
saved             221080 non-null int64
shared            221080 non-null int64
user              221080 non-null object
dtypes: float64(2), int64(3), object(2)
memory usage: 13.5+ MB


In [40]:
chefdf = pd.merge(chefdf, chef02df,  how='left', left_on=['name','user'], 
                  right_on = ['name','user'], suffixes=('','_02'))

In [41]:
with open('data/chefkoch_03.json') as data_file:    
    chef03 = json.load(data_file)
clean_new(chef03)
chef03df = convert(chef03)
chef03df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 221752 entries, 0 to 319577
Data columns (total 7 columns):
avg_rating        221752 non-null float64
name              221752 non-null object
number_ratings    221752 non-null int64
printed           221752 non-null float64
saved             221752 non-null int64
shared            221752 non-null int64
user              221752 non-null object
dtypes: float64(2), int64(3), object(2)
memory usage: 13.5+ MB


In [42]:
chefdf = pd.merge(chefdf, chef03df,  how='left', left_on=['name','user'], 
                  right_on = ['name','user'], suffixes=('','_03'))

In [43]:
chefdf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 486613 entries, 0 to 486612
Data columns (total 36 columns):
avg_rating           486613 non-null float64
date                 486613 non-null datetime64[ns]
day                  486613 non-null float64
month                486613 non-null float64
name                 486613 non-null object
number_ratings       486613 non-null int64
printed              486613 non-null int64
saved                486613 non-null int64
shared               486613 non-null int64
user                 486613 non-null object
year                 486613 non-null float64
avg_rating_11        486462 non-null float64
number_ratings_11    486462 non-null float64
printed_11           486462 non-null float64
saved_11             486462 non-null float64
shared_11            486462 non-null float64
avg_rating_12        486344 non-null float64
number_ratings_12    486344 non-null float64
printed_12           486344 non-null float64
saved_12             486344 non-null 

In [44]:
print(len(chefdf[['name','user']]))

486613


## Duplicate entries

As we can easily see, the DataFrame has increased in its size from 217,258 to 468,014.
This is most probably due to small changes in the datasets over time, e.g. users changing their names or recipe titles at a later time. 
This means that the attributes chosen for joining the datasets (name and user) are not optimal, but unfortunately there is no other combination that would provide better results.
These redundancies need to be found and removed to have a clean dataset.

In [45]:
chefdf = chefdf.drop_duplicates(subset=['name', 'user'])
len(chefdf.name)

216585

In [46]:
# There is missing data again, so we drop them as well:
chefdf = chefdf.dropna()

In [47]:
# This has reduced the dataset by approx. a third:
len(chefdf.name)

215977

In [24]:
# Export CSV
chefdf.to_csv('exports/trend_data_chefkoch.csv')

In [51]:
# Average: avg_rating
print("\n avg_rating:")
print(chefdf['avg_rating'].mean())
print(chefdf['avg_rating_11'].mean())
print(chefdf['avg_rating_12'].mean())
print(chefdf['avg_rating_01'].mean())
print(chefdf['avg_rating_02'].mean())
print(chefdf['avg_rating_03'].mean())

# Aggregation: number_ratings
print("\n number_ratings:")
print(chefdf['number_ratings'].sum())
print(chefdf['number_ratings_11'].sum())
print(chefdf['number_ratings_12'].sum())
print(chefdf['number_ratings_01'].sum())
print(chefdf['number_ratings_02'].sum())
print(chefdf['number_ratings_03'].sum())

# Aggregation: printed
print("\n printed:")
print(chefdf['printed'].sum())
print(chefdf['printed_11'].sum())
print(chefdf['printed_12'].sum())
print(chefdf['printed_01'].sum())
print(chefdf['printed_02'].sum())
print(chefdf['printed_03'].sum())

# Aggregation: saved
print("\n saved:")
print(chefdf['saved'].sum())
print(chefdf['saved_11'].sum())
print(chefdf['saved_12'].sum())
print(chefdf['saved_01'].sum())
print(chefdf['saved_02'].sum())
print(chefdf['saved_03'].sum())

# Aggregation: shared
print("\n shared:")
print(chefdf['shared'].sum())
print(chefdf['shared_11'].sum())
print(chefdf['shared_12'].sum())
print(chefdf['shared_01'].sum())
print(chefdf['shared_02'].sum())
print(chefdf['shared_03'].sum())


 avg_rating:
3.48150465096
3.48569551387
3.48866416331
3.49199211027
3.49526861657
3.49759108609

 number_ratings:
2407254
2437394.0
2459617.0
2483845.0
2508646.0
2526337.0

 printed:
550408646
557385286.0
563838808.0
571425561.0
577030421.0
581359847.0

 saved:
76167205
76705867.0
77065700.0
77442128.0
77850507.0
78161592.0

 shared:
5501673
5511549.0
5512105.0
5510087.0
5513907.0
5511075.0


Now we actually only have the recipes left that have been in each snapshot of our the data. By dropping the duplicates and recipes with missing values, we have reduced the dataset's size significantly, but created a consistent set that is suitable for trend analysis.

### Calculating the differences

In addition to the dataset above, which shows the numerical values for each timestep, we can also create a dataset that holds the deltas (i.e. the differences between each of the timesteps). The following abbreviations are used:

- clicks = cl
- comment_number = co
- favorites = f
- number_votes = v
- delta(Okt-Nov) = d1
- delta(Nov-Dec) = d2
- delta(Dec-Jan) = d3
- delta(Jan-Feb) = d4
- delta(Feb-Mar) = d5

In [53]:
deltadf = pd.DataFrame(chefdf[['name', 'user']])

# average_rating
deltadf['ar_d1'] = chefdf['avg_rating_11']-chefdf['avg_rating']
deltadf['ar_d2'] = chefdf['avg_rating_12']-chefdf['avg_rating_11']
deltadf['ar_d3'] = chefdf['avg_rating_01']-chefdf['avg_rating_12']
deltadf['ar_d4'] = chefdf['avg_rating_02']-chefdf['avg_rating_01']
deltadf['ar_d5'] = chefdf['avg_rating_03']-chefdf['avg_rating_02']

# number_ratings
deltadf['nr_d1'] = chefdf['number_ratings_11']-chefdf['number_ratings']
deltadf['nr_d2'] = chefdf['number_ratings_12']-chefdf['number_ratings_11']
deltadf['nr_d3'] = chefdf['number_ratings_01']-chefdf['number_ratings_12']
deltadf['nr_d4'] = chefdf['number_ratings_02']-chefdf['number_ratings_01']
deltadf['nr_d5'] = chefdf['number_ratings_03']-chefdf['number_ratings_02']

# printed
deltadf['p_d1'] = chefdf['printed_11']-chefdf['printed']
deltadf['p_d2'] = chefdf['printed_12']-chefdf['printed_11']
deltadf['p_d3'] = chefdf['printed_01']-chefdf['printed_12']
deltadf['p_d4'] = chefdf['printed_02']-chefdf['printed_01']
deltadf['p_d5'] = chefdf['printed_03']-chefdf['printed_02']

# saved
deltadf['sa_d1'] = chefdf['saved_11']-chefdf['saved']
deltadf['sa_d2'] = chefdf['saved_12']-chefdf['saved_11']
deltadf['sa_d3'] = chefdf['saved_01']-chefdf['saved_12']
deltadf['sa_d4'] = chefdf['saved_02']-chefdf['saved_01']
deltadf['sa_d5'] = chefdf['saved_03']-chefdf['saved_02']

# shared
deltadf['sh_d1'] = chefdf['shared_11']-chefdf['shared']
deltadf['sh_d2'] = chefdf['shared_12']-chefdf['shared_11']
deltadf['sh_d3'] = chefdf['shared_01']-chefdf['shared_12']
deltadf['sh_d4'] = chefdf['shared_02']-chefdf['shared_01']
deltadf['sh_d5'] = chefdf['shared_03']-chefdf['shared_02']

deltadf.head(10)

Unnamed: 0,name,user,ar_d1,ar_d2,ar_d3,ar_d4,ar_d5,nr_d1,nr_d2,nr_d3,...,sa_d1,sa_d2,sa_d3,sa_d4,sa_d5,sh_d1,sh_d2,sh_d3,sh_d4,sh_d5
0,Rhabarber-Streusel-Kuchen,mickyjenny,0.0,0.0,-0.01,0.0,0.0,0.0,0.0,1.0,...,14.0,7.0,1.0,12.0,11.0,0.0,0.0,0.0,0.0,0.0
1,Holunder - Balsamico - Essig,rosemarywitch,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,11.0,12.0,9.0,7.0,12.0,0.0,0.0,0.0,0.0,0.0
2,Apfelkuchen mit Streuseln,mimamutti,0.0,0.0,0.0,0.0,0.0,30.0,12.0,10.0,...,242.0,123.0,89.0,100.0,112.0,0.0,0.0,0.0,0.0,0.0
3,Salzige Dampfnudeln,t_segler,-0.02,0.0,0.01,0.0,0.01,6.0,5.0,6.0,...,28.0,24.0,27.0,25.0,26.0,0.0,0.0,0.0,0.0,0.0
4,Der beste Pizzateig,Jehuty,0.0,0.0,0.0,0.0,0.0,35.0,19.0,21.0,...,283.0,159.0,311.0,341.0,267.0,0.0,0.0,0.0,0.0,0.0
5,Grießbrei von Großmutter,Jona13,0.01,0.0,0.0,0.0,0.0,38.0,16.0,19.0,...,183.0,82.0,87.0,272.0,149.0,0.0,0.0,0.0,0.0,0.0
6,Friedas genialer Hefezopf,lone_bohne,0.01,0.0,0.0,0.0,0.0,32.0,16.0,13.0,...,121.0,89.0,121.0,191.0,180.0,0.0,0.0,0.0,0.0,0.0
7,"Gourmet-Schoko-Pudding selbstgemacht, sahnig u...",Stift1,0.0,0.0,0.0,0.0,0.01,3.0,3.0,5.0,...,53.0,34.0,100.0,66.0,52.0,0.0,0.0,0.0,0.0,0.0
8,Hamburger und Hot Dog Buns,Küchenpunk,-0.02,0.01,-0.01,0.01,-0.01,3.0,3.0,5.0,...,36.0,46.0,54.0,73.0,51.0,0.0,0.0,0.0,0.0,0.0
9,Gänsekeulen aus dem Bratschlauch,Arsenase1,0.0,0.0,-0.04,0.0,0.01,0.0,10.0,22.0,...,61.0,148.0,172.0,25.0,21.0,0.0,0.0,0.0,0.0,0.0


In [54]:
# Remove characters and redundant whitespaces
for i, words in enumerate(chefdf['name']):
    for char in ['"','-','=','!','(',')','.','♥',',',':','~','„','“','/','–','&','+',';','*','☆']:
        words = words.replace(char,' ')
    words = words.replace('  ',' ')

In [55]:
# Export CSV
deltadf.to_csv('exports/trend_deltas_chefkoch.csv')

### Recalculation of General Statistics (Exploratory Analysis)

Considering the unregularities in the datasets, a corrected calculation of total recipes for this website needs to be performed (by removing duplicates):

In [57]:
countdf = pd.DataFrame(chef03)
countdf = countdf.drop_duplicates(subset=['name', 'user'])
countdf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 318291 entries, 0 to 319576
Data columns (total 18 columns):
avg_rating        222833 non-null object
calories          318291 non-null object
date              222833 non-null object
difficulty        318291 non-null object
ingredients       318291 non-null object
name              318291 non-null object
number_ratings    222833 non-null object
preparation       318291 non-null object
printed           222833 non-null float64
saved             222833 non-null object
shared            222833 non-null object
subtitle          163773 non-null object
time_cook         318291 non-null object
time_rest         318291 non-null object
time_work         318291 non-null object
user              318291 non-null object
user_activity     318291 non-null object
user_date         318291 non-null object
dtypes: float64(1), object(17)
memory usage: 46.1+ MB


In [58]:
count1df = pd.DataFrame(chef02)
count1df = count1df.drop_duplicates(subset=['name', 'user'])
count1df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 316953 entries, 0 to 318239
Data columns (total 18 columns):
avg_rating        221709 non-null object
calories          316953 non-null object
date              221709 non-null object
difficulty        316953 non-null object
ingredients       316953 non-null object
name              316953 non-null object
number_ratings    221709 non-null object
preparation       316953 non-null object
printed           221709 non-null float64
saved             221709 non-null object
shared            221709 non-null object
subtitle          162879 non-null object
time_cook         316953 non-null object
time_rest         316953 non-null object
time_work         316953 non-null object
user              316953 non-null object
user_activity     316953 non-null object
user_date         316953 non-null object
dtypes: float64(1), object(17)
memory usage: 45.9+ MB


In [59]:
count6df = pd.DataFrame(chef10)
count6df = count6df.drop_duplicates(subset=['name', 'user'])
count6df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 312488 entries, 0 to 313750
Data columns (total 21 columns):
avg_rating        216421 non-null object
calories          312488 non-null object
date              216421 non-null object
day               216421 non-null float64
difficulty        312488 non-null object
ingredients       312488 non-null object
month             216421 non-null float64
name              312488 non-null object
number_ratings    312488 non-null object
preparation       312488 non-null object
printed           312488 non-null object
saved             312488 non-null object
shared            312488 non-null object
subtitle          160006 non-null object
time_cook         312488 non-null object
time_rest         312488 non-null object
time_work         312488 non-null object
user              312488 non-null object
user_activity     312488 non-null object
user_date         312488 non-null object
year              216421 non-null float64
dtypes: float64(3), objec

In [60]:
# Recalculating the deltas for general statistics (exploratory analysis)
# Recipe count
print(len(countdf['name']))
print(len(countdf['name'])-len(count1df['name']))
print(len(countdf['name'])-len(count6df['name']))

318291
1338
5803


In [61]:
# Distinct authors
print(len(countdf['user'].unique()))
print(len(countdf['user'].unique())-len(count1df['user'].unique()))
print(len(countdf['user'].unique())-len(count6df['user'].unique()))

80459
390
1699


In [63]:
# Number of ratings
print(pd.to_numeric(countdf['number_ratings']).sum())
print(pd.to_numeric(countdf['number_ratings']).sum()-pd.to_numeric(count1df['number_ratings']).sum())
print(pd.to_numeric(countdf['number_ratings']).sum()-pd.to_numeric(count6df['number_ratings']).sum())

2553803.0
19591.0
130573.0


As we can see, there are slight changes in the number, but they are negligible.