# Dummy data creation representing Q1 2021 (iteration 2)
<hr>
<b>Warning:</b><br>
When this file is run, the data exported in the file <u>iteration-2-Q1.csv</u> will be updated.<br>
This <u>might not be the same</u> as what is showned on the Technical Documentation.
<hr>

In [1]:
# Remove the warnings for presentation of the notebook. During the development, the warnings were not ignored.
import warnings
warnings.filterwarnings('ignore')

First, import numpy and pandas libraries, then read the Q4 dummy dataset. In this stage, the Q4 dataset will be cleaned to only get the userID and its dependent variables like companyID and country (thus the subset).

In [2]:
import numpy as np
import pandas as pd

In [3]:
df = pd.read_csv('Q4-all.csv')
df = df[['userID', 'companyID', 'country']]
df = df.drop_duplicates()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 201 entries, 0 to 1035
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userID     201 non-null    int64 
 1   companyID  201 non-null    int64 
 2   country    201 non-null    object
dtypes: int64(2), object(1)
memory usage: 6.3+ KB


Now, every row represent one single user. In the below code, each row is assigned to a new dataframe until it fills up the 101.101 rows request. The number 101.101 was assigned at random and done so that the amount of activity is not the same as in Q4 dataset. In the assignment of new dataframe for Q1, each row from the subseted dataframe was assigned as a value of one column named values. Thus in line 10, the values of the variable values were extracted to individual columns.

In [4]:
df1 = []

for x in range(101101):
    row = df.sample()
    row = row.values.tolist()
    df1.append(row)

df1 = pd.DataFrame(df1)
df1.rename(columns={0: 'values',}, inplace=True)
df1[['userID', 'companyID', 'country']] = pd.DataFrame(df1["values"].to_list(), columns=['userID', 'companyID', 'country'])
df = df1
df1.head()

Unnamed: 0,values,userID,companyID,country
0,"[32, 37, Turkey]",32,37,Turkey
1,"[140, 41, Indonesia]",140,41,Indonesia
2,"[195, 8, Turkey]",195,8,Turkey
3,"[19, 23, Turkey]",19,23,Turkey
4,"[84, 17, Indonesia]",84,17,Indonesia


In the below cell, the timestamp variable is added again using random and datetime libraries.

In [5]:
from random import randrange
from datetime import timedelta
from datetime import datetime

timestamp = []

for x in range(101101):
        def random_date(start, end):
            delta = end - start
            int_delta = (delta.days * 24 * 60 * 60) + delta.seconds
            random_second = randrange(int_delta)
            return start + timedelta(seconds=random_second)
        d1 = datetime.strptime('1/1/2021 00:00', '%m/%d/%Y %H:%M')
        d2 = datetime.strptime('3/31/2021 23:59', '%m/%d/%Y %H:%M')
        time = random_date(d1, d2)
        timestamp.append(time)
        
df['timestamp'] = timestamp
df.head()

Unnamed: 0,values,userID,companyID,country,timestamp
0,"[32, 37, Turkey]",32,37,Turkey,2021-02-05 08:17:01
1,"[140, 41, Indonesia]",140,41,Indonesia,2021-01-05 22:56:21
2,"[195, 8, Turkey]",195,8,Turkey,2021-02-11 12:56:44
3,"[19, 23, Turkey]",19,23,Turkey,2021-03-17 19:24:07
4,"[84, 17, Indonesia]",84,17,Indonesia,2021-01-05 13:55:09


More columns were created based on the extraction of the timestamp, such as year, quarter, month, week, and day.

In [6]:
df['year'] = pd.DatetimeIndex(df['timestamp']).year
df['quarter'] = pd.DatetimeIndex(df['timestamp']).quarter
df['month'] = pd.DatetimeIndex(df['timestamp']).month
df['weekNumber'] = pd.DatetimeIndex(df['timestamp']).week
df['dayNumber'] = pd.DatetimeIndex(df['timestamp']).day
df.head()

Unnamed: 0,values,userID,companyID,country,timestamp,year,quarter,month,weekNumber,dayNumber
0,"[32, 37, Turkey]",32,37,Turkey,2021-02-05 08:17:01,2021,1,2,5,5
1,"[140, 41, Indonesia]",140,41,Indonesia,2021-01-05 22:56:21,2021,1,1,1,5
2,"[195, 8, Turkey]",195,8,Turkey,2021-02-11 12:56:44,2021,1,2,6,11
3,"[19, 23, Turkey]",19,23,Turkey,2021-03-17 19:24:07,2021,1,3,11,17
4,"[84, 17, Indonesia]",84,17,Indonesia,2021-01-05 13:55:09,2021,1,1,1,5


Again, using random library, the activity column was updated with assignment of random Page and Event.

In [7]:
df['activity'] = np.random.randint(1,3,df.shape[0])
df['activity'] = df['activity'].replace(1, 'Page')
df['activity'] = df['activity'].replace(2, 'Event')
df.head()

Unnamed: 0,values,userID,companyID,country,timestamp,year,quarter,month,weekNumber,dayNumber,activity
0,"[32, 37, Turkey]",32,37,Turkey,2021-02-05 08:17:01,2021,1,2,5,5,Event
1,"[140, 41, Indonesia]",140,41,Indonesia,2021-01-05 22:56:21,2021,1,1,1,5,Event
2,"[195, 8, Turkey]",195,8,Turkey,2021-02-11 12:56:44,2021,1,2,6,11,Page
3,"[19, 23, Turkey]",19,23,Turkey,2021-03-17 19:24:07,2021,1,3,11,17,Event
4,"[84, 17, Indonesia]",84,17,Indonesia,2021-01-05 13:55:09,2021,1,1,1,5,Event


Lastly, the same random assignment was done to the web pages values. The pages values were the same with the Q4 values.

In [8]:
import random

webPages_ori = [
    'https://en.wikipedia.org',
    'https://en.wikipedia.org/wiki/Main_Page',
    'https://en.wikipedia.org/wiki/Financial_services',
    'https://en.wikipedia.org/wiki/Financial_technology',
    'https://en.wikipedia.org/wiki/Accounting',
    'https://en.wikipedia.org/wiki/Bookkeeping',
]

webPages_random = []

for x in range(101101) :
    page = random.choice(webPages_ori)
    webPages_random.append(page)
    
df['page'] = webPages_random
df.head()

Unnamed: 0,values,userID,companyID,country,timestamp,year,quarter,month,weekNumber,dayNumber,activity,page
0,"[32, 37, Turkey]",32,37,Turkey,2021-02-05 08:17:01,2021,1,2,5,5,Event,https://en.wikipedia.org/wiki/Financial_services
1,"[140, 41, Indonesia]",140,41,Indonesia,2021-01-05 22:56:21,2021,1,1,1,5,Event,https://en.wikipedia.org/wiki/Financial_services
2,"[195, 8, Turkey]",195,8,Turkey,2021-02-11 12:56:44,2021,1,2,6,11,Page,https://en.wikipedia.org/wiki/Financial_services
3,"[19, 23, Turkey]",19,23,Turkey,2021-03-17 19:24:07,2021,1,3,11,17,Event,https://en.wikipedia.org/wiki/Financial_techno...
4,"[84, 17, Indonesia]",84,17,Indonesia,2021-01-05 13:55:09,2021,1,1,1,5,Event,https://en.wikipedia.org/wiki/Accounting


The dataset was then ready to be put into csv for further use.

In [9]:
df.to_csv('../iteration-2/algorithm/iteration-2-Q1.csv')
df.to_csv('../iteration-3/algorithm/iteration-2-Q1.csv')