# Dummy data creation representing Q4 2020
<hr>
<b>Warning:</b><br>
When this file is run, the data exported in the file <u>Q4-all.csv</u> will be updated.<br>
This <u>might not be the same</u> as what is showned on the Technical Documentation.
<hr>

In [1]:
# Remove the warnings for presentation of the notebook. During the development, the warnings were not ignored.
import warnings
warnings.filterwarnings('ignore')

First, create a Pandas dataframe with a column called userID. Then, fill in the userID column with 100 thousand random interger between the number 0 and 200. This represent 100 thousand user activities of 201 users.

In [2]:
import numpy as np
import pandas as pd

df = pd.DataFrame(columns=['userID'])

In [3]:
from random import seed
from random import randint

seed(1)

userID = []

for _ in range(100000):
    user = randint(0, 200)
    userID.append(user)

df['userID'] = userID
df['userID'].value_counts()

116    558
185    551
64     551
131    551
68     545
      ... 
85     464
196    462
132    458
182    444
74     433
Name: userID, Length: 201, dtype: int64

The next step is to create companyID variable. Using the same method, companyID variable is assigned to 201 unique userID which consisted of random interger from 0 to 50. This is done to create illusion of several users per company.

In [4]:
df_getCompanyID = df['userID'].value_counts()
df_getCompanyID = df_getCompanyID.reset_index()
df_getCompanyID = df_getCompanyID.rename(columns={'index': 'userID', 'userID': 'count'})
companyID = []

for _ in range(201):
    company = randint(0, 50)
    companyID.append(company)

df_getCompanyID['companyID'] = companyID
df = df.merge(df_getCompanyID, on='userID', how='left')
df = df.drop(['count'], axis=1)
df.head()

Unnamed: 0,userID,companyID
0,34,5
1,145,44
2,195,8
3,16,7
4,65,42


Again, same method was used to assign the country of each users. Each unique userID was assigned number 1 and 2, which then renamed into Indonesia and Turkey.

In [5]:
df_getCountry = df['userID'].value_counts()
df_getCountry = df_getCountry.reset_index()
df_getCountry = df_getCountry.rename(columns={
                    'index': 'userID', 'userID': 'count'})

countries = []

for _ in range(201):
    country = randint(1, 2)
    countries.append(country)

df_getCountry['country'] = countries
df = df.merge(df_getCountry, on='userID', how='left')
df = df.drop(['count'], axis=1)
df['country'] = df['country'].replace(1, 'Indonesia')
df['country'] = df['country'].replace(2, 'Turkey')
df.head()

Unnamed: 0,userID,companyID,country
0,34,5,Indonesia
1,145,44,Turkey
2,195,8,Turkey
3,16,7,Turkey
4,65,42,Turkey


Although not far different from the other columns, creation of timestamp was a little bit more complicated. The function create an assignment of random date and time within the scope of Q4 2020 to the entirety of 100 thousand user activity rows.

In [6]:
from random import randrange
from datetime import timedelta
from datetime import datetime

timestamp = []

for x in range(100000):
        def random_date(start, end):
            delta = end - start
            int_delta = (delta.days * 24 * 60 * 60) + delta.seconds
            random_second = randrange(int_delta)
            return start + timedelta(seconds=random_second)
        d1 = datetime.strptime('10/1/2020 00:00', '%m/%d/%Y %H:%M')
        d2 = datetime.strptime('12/31/2020 23:59', '%m/%d/%Y %H:%M')
        time = random_date(d1, d2)
        timestamp.append(time)
        
df['timestamp'] = timestamp
df.head()

Unnamed: 0,userID,companyID,country,timestamp
0,34,5,Indonesia,2020-10-04 23:02:44
1,145,44,Turkey,2020-10-22 19:26:33
2,195,8,Turkey,2020-12-31 09:02:33
3,16,7,Turkey,2020-11-17 16:32:24
4,65,42,Turkey,2020-12-02 01:32:00


More columns were created based on the extraction of the timestamp, such as year, quarter, month, week, and day.

In [7]:
df['year'] = pd.DatetimeIndex(df['timestamp']).year
df['quarter'] = pd.DatetimeIndex(df['timestamp']).quarter
df['month'] = pd.DatetimeIndex(df['timestamp']).month
df['weekNumber'] = pd.DatetimeIndex(df['timestamp']).week
df['dayNumber'] = pd.DatetimeIndex(df['timestamp']).day
df.head()

Unnamed: 0,userID,companyID,country,timestamp,year,quarter,month,weekNumber,dayNumber
0,34,5,Indonesia,2020-10-04 23:02:44,2020,4,10,40,4
1,145,44,Turkey,2020-10-22 19:26:33,2020,4,10,43,22
2,195,8,Turkey,2020-12-31 09:02:33,2020,4,12,53,31
3,16,7,Turkey,2020-11-17 16:32:24,2020,4,11,47,17
4,65,42,Turkey,2020-12-02 01:32:00,2020,4,12,49,2


Further, random assignment of Page and Event activities were done using similar method to assignment of country variable values.

In [8]:
df['activity'] = np.random.randint(1,3,df.shape[0])
df['activity'] = df['activity'].replace(1, 'Page')
df['activity'] = df['activity'].replace(2, 'Event')
df.head()

Unnamed: 0,userID,companyID,country,timestamp,year,quarter,month,weekNumber,dayNumber,activity
0,34,5,Indonesia,2020-10-04 23:02:44,2020,4,10,40,4,Page
1,145,44,Turkey,2020-10-22 19:26:33,2020,4,10,43,22,Page
2,195,8,Turkey,2020-12-31 09:02:33,2020,4,12,53,31,Event
3,16,7,Turkey,2020-11-17 16:32:24,2020,4,11,47,17,Page
4,65,42,Turkey,2020-12-02 01:32:00,2020,4,12,49,2,Event


Lastly, the assignment of webpages urls were done not using random interger, but using a pre-defined links that were put into array.

In [9]:
import random

webPages_ori = [
    'https://en.wikipedia.org',
    'https://en.wikipedia.org/wiki/Main_Page',
    'https://en.wikipedia.org/wiki/Financial_services',
    'https://en.wikipedia.org/wiki/Financial_technology',
    'https://en.wikipedia.org/wiki/Accounting',
    'https://en.wikipedia.org/wiki/Bookkeeping',
]

webPages_random = []

for x in range(100000) :
    page = random.choice(webPages_ori)
    webPages_random.append(page)
    
df['page'] = webPages_random
df.head()

Unnamed: 0,userID,companyID,country,timestamp,year,quarter,month,weekNumber,dayNumber,activity,page
0,34,5,Indonesia,2020-10-04 23:02:44,2020,4,10,40,4,Page,https://en.wikipedia.org/wiki/Financial_services
1,145,44,Turkey,2020-10-22 19:26:33,2020,4,10,43,22,Page,https://en.wikipedia.org/wiki/Financial_techno...
2,195,8,Turkey,2020-12-31 09:02:33,2020,4,12,53,31,Event,https://en.wikipedia.org/wiki/Bookkeeping
3,16,7,Turkey,2020-11-17 16:32:24,2020,4,11,47,17,Page,https://en.wikipedia.org/wiki/Main_Page
4,65,42,Turkey,2020-12-02 01:32:00,2020,4,12,49,2,Event,https://en.wikipedia.org/wiki/Financial_services


The dataset was then ready to be put into csv for further use.

In [10]:
df.to_csv('Q4-all.csv')
df.to_csv('../iteration-1/algorithm/Q4-all.csv')
df.to_csv('../iteration-2/algorithm/Q4-all.csv')
df.to_csv('../iteration-3/algorithm/Q4-all.csv')