# Creating a dummy dataset

As we wait for the _real_ data to be pulled we can create a dummy database or datasets off of what we know. 

In the data pull request we asked for 5 tables:
- clients
- organizations
- locations
- opportunities
- access_events

You can see the variables/columns we ask of each table in the google sheet `Data schema/UCB data pull` in our shared Google Drive.

We will be using a combination of pandas/numpy and Faker to create these.

In [16]:
import pandas as pd
import numpy as np
from faker import Faker
import datetime as dt

fake = Faker()

## clients

* id (INT11)
* is_admin (TINYINT1)
* created_at (DATETIME)
* updated_at (DATETIME)
* last_login (DATETIME)

This means we have an 11-character long INT column, another 1-character long (1/0) that serves as a boolean column, and 3 datetimes. The most complicated part is the datetimes. `updated_at` has to come later or at the same time as `created_at` (cannot be earlier) and last login must also come **after** `created_at`.

In [28]:
client_id = np.random.randint(low = 10_000_000_000, high = 99_999_999_999, size = 50)
client_id

array([53068371121, 55856732757, 53580875532, 41393188944, 61862458655,
       19965061455, 34081428847, 43382562265, 96677197701, 56419189895,
       20486189041, 92956547716, 12447474077, 57679041850, 27878828471,
       95281492829, 59001664775, 35512086529, 45117692169, 84003369606,
       28019716997, 90403410370, 24692282347, 66350551499, 86764307210,
       98628418097, 82607429373, 43168181756, 75505419126, 29679835518,
       42777655781, 97565632436, 80947904478, 13856973767, 17570065541,
       80812130250, 41183877799, 89301812608, 73291893139, 82179931791,
       63663002445, 40674822994, 73106357515, 66075792549, 68808331494,
       77186625461, 10062939849, 31866050603, 13732341874, 18412432313])

In [29]:
client_is_admin = np.random.randint(low = 0, high = 2, size = 50)
client_is_admin

array([1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0,
       1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 1, 0])

In [30]:
client_created_at = []
for _ in range(50):
    client_created_at.append(fake.date_time_between_dates(datetime_start = dt.datetime(2019,1,1), datetime_end = dt.datetime(2021,3,1)))

In [31]:
# updated at must come after `created_at`
client_updated_at = []
for date_created in client_created_at:
    client_updated_at.append(fake.date_time_between_dates(datetime_start = date_created, datetime_end = dt.datetime(2021,3,1)))

In [32]:
# let's make all last logins *after* updated_at
client_last_login = []
for date_updated in client_updated_at:
    client_last_login.append(fake.date_time_between_dates(datetime_start = date_updated, datetime_end = dt.datetime(2021,3,1)))

In [33]:
clients_df = pd.DataFrame(data = {
    "id": client_id, 
    "is_admin": client_is_admin, 
    "created_at": client_created_at, 
    "updated_at": client_updated_at, 
    "last_login": client_last_login
})

clients_df.head()

Unnamed: 0,id,is_admin,created_at,updated_at,last_login
0,53068371121,1,2020-09-06 02:36:26,2020-11-27 23:23:58,2020-11-28 01:11:11
1,55856732757,0,2019-07-19 04:15:58,2020-12-09 01:23:06,2020-12-24 09:24:34
2,53580875532,1,2020-12-31 05:13:50,2021-01-27 20:07:30,2021-02-18 16:48:34
3,41393188944,0,2020-07-08 05:14:42,2020-11-05 02:06:55,2020-12-28 00:57:57
4,61862458655,1,2020-06-30 20:41:33,2020-11-22 07:59:46,2020-12-18 16:20:36


***
## organizations

* id INT(11)
* client_id INT(11)
* name VARCHAR(255)
* website VARCHAR(255)
* description MEDIUMTEXT
* created_at DATETIME
* updated_at DATETIME
* slug VARCHAR(255)
* searchable_by_organizations VARCHAR(255)
* is_searchable TINYINT(1)
* region VARCHAR(255)
* lat FLOAT
* lng FLOAT
* last_verified_at DATETIME
* marked_deleted TINYINT(1)
* deleter_id INT(11)
* is_closed TINYINT(1)

In [38]:
org_id = np.random.randint(low = 10_000_000_000, high = 99_999_999_999, size = 50)
org_id

array([76327319801, 78137121915, 84507898569, 22722476147, 95965357062,
       19864305342, 39673701505, 44734027201, 19301973909, 73246013987,
       99084709007, 13976104711, 72911050290, 31385758129, 87084250879,
       71113519379, 69603878938, 51493858704, 75512547632, 71642210292,
       41086704619, 47999295923, 39264351350, 35767394379, 84345654591,
       76619893204, 41632819120, 40441536909, 47385103929, 98227539971,
       95583944326, 20542405451, 82452590146, 35175178253, 14471430655,
       32835652572, 44440361480, 28306186016, 81840431545, 43669106666,
       26542744947, 92457716673, 86770472243, 19722145381, 91725481147,
       19954489250, 28965888958, 40941740635, 62816989785, 87353385757])

In [39]:
org_client_id = client_id

In [40]:
org_name = []
for _ in range(50):
    org_name.append(fake.company())

In [41]:
org_website = []
for _ in range(50):
    org_website.append(fake.domain_name())

In [42]:
org_description = []
for _ in range(50):
    org_description.append(fake.paragraph())

In [43]:
org_created_at = []
for _ in range(50):
    org_created_at.append(fake.date_time_between_dates(datetime_start = dt.datetime(2019,1,1), datetime_end = dt.datetime(2021,3,1)))

In [44]:
# updated at must come after `created_at`
org_updated_at = []
for date_created in client_created_at:
    org_updated_at.append(fake.date_time_between_dates(datetime_start = date_created, datetime_end = dt.datetime(2021,3,1)))

In [47]:
org_slug = [name.replace(" ", "_").replace(",", "").lower() for name in org_name]

In [49]:
org_searchable_by_organizations = []
for _ in range(50):
    org_searchable_by_organizations.append(fake.paragraph())

In [51]:
org_is_searchable = np.random.randint(low = 0, high = 2, size = 50)

In [54]:
org_region = ["Los Angeles"] * 50

In [55]:
org_lat = []
org_lng = []
for _ in range(50):
    org_lat.append(fake.latitude())
    org_lng.append(fake.longitude())

In [57]:
# last verified - in the past year
org_last_verified = []
for _ in range(50):
    org_last_verified.append(fake.date_time_between_dates(datetime_start = dt.datetime(2020,3,1), datetime_end = dt.datetime(2021,3,1)))

In [58]:
org_marked_deleted = np.random.randint(low = 0, high = 2, size = 50)

In [60]:
org_deleter_id = np.random.randint(low = 10_000_000_000, high = 99_999_999_999, size = 50)

In [59]:
org_is_closed = np.random.randint(low = 0, high = 2, size = 50)

In [62]:
organizations_df = pd.DataFrame(data = {
    "id": org_id,
    "client_id": org_client_id,
    "name": org_name,
    "website": org_website,
    "description": org_description,
    "created_at": org_created_at,
    "updated_at": org_updated_at,
    "slug": org_slug,
    "searchable_by_organizations": org_searchable_by_organizations,
    "is_searchable": org_is_searchable,
    "region": org_region,
    "lat": org_lat,
    "lng": org_lng,
    "last_verified_at": org_last_verified,
    "marked_deleted": org_marked_deleted,
    "deleter_id": org_deleter_id,
    "is_closed": org_is_closed,
})

organizations_df.head()

Unnamed: 0,id,client_id,name,website,description,created_at,updated_at,slug,searchable_by_organizations,is_searchable,region,lat,lng,last_verified_at,marked_deleted,deleter_id,is_closed
0,76327319801,53068371121,Klein LLC,jenkins.com,Prepare move role culture. Campaign power bag ...,2019-03-15 02:39:25,2021-01-05 19:49:03,klein_llc,Turn federal discussion although air level coa...,0,Los Angeles,40.897703,178.944637,2020-05-13 21:49:08,0,41984010408,0
1,78137121915,55856732757,"Moore, Murray and Carpenter",stevenson.com,Next lead west air trial end. Know real act ot...,2019-03-16 09:35:51,2020-10-20 08:08:33,moore_murray_and_carpenter,Parent bring bag provide picture their alone. ...,1,Los Angeles,82.811515,-51.115769,2020-06-12 18:03:58,1,62259441723,1
2,84507898569,53580875532,"Cook, Massey and Ross",hamilton.org,Ok imagine since spring whom eight building. R...,2019-07-03 01:26:21,2021-02-09 20:05:04,cook_massey_and_ross,Seven report forget conference office analysis...,0,Los Angeles,81.5554735,55.015759,2021-02-15 17:45:23,0,11147866092,1
3,22722476147,41393188944,"Tucker, Brown and Martin",williams.com,Finally culture fight answer rate left include...,2019-08-15 08:06:14,2020-12-27 15:38:39,tucker_brown_and_martin,Page general evening popular ago door whatever...,1,Los Angeles,70.7408555,40.202486,2021-01-30 16:09:28,0,63584434937,0
4,95965357062,61862458655,Kent Inc,hughes.com,Its pass home. Chance we new talk help. Econom...,2019-05-22 17:12:09,2020-08-01 17:23:18,kent_inc,Occur computer sure. Establish per can.,0,Los Angeles,0.2406745,77.319157,2020-12-12 07:34:22,1,17830292638,1


***
## locations

* id INT(11)
* locatable_id INT(11)
* locatable_type VARCHAR(255)
* name VARCHAR(255)
* address VARCHAR(255)
* unit VARCHAR(255)
* city VARCHAR(255)
* state VARCHAR(255)
* zip_code VARCHAR(255)
* is_primary TINYINT(1)
* lat FLOAT
* long FLOAT
* created_at DATETIME
* updated_at DATETIME
* show_on_organization TINYINT(1)

In [63]:
locations_id = np.random.randint(low = 10_000_000_000, high = 99_999_999_999, size = 50)
locations_locatable_id = np.random.randint(low = 10_000_000_000, high = 99_999_999_999, size = 50)

In [65]:
# We can create multiple lists at the same time to save space/time
locations_locatable_id = []
locations_name = []
for _ in range(50):
    locations_locatable_id.append(fake.sentence())
    locations_name.append(fake.company())

In [87]:
# we can split an address into each of its elements
locations_address = []
locations_unit = []
locations_city = []
locations_state = []
locations_zip_code = []
for _ in range(50):
    street_address = fake.street_address()
    if "Suite" in street_address:
        unit = street_address.split("Suite")[-1]
    elif "Apt." in street_address:
        unit = street_address.split("Apt.")[-1]
    else:
        unit = ""
    
    locations_address.append(street_address)
    locations_unit.append(unit)
    locations_city.append("Los Angeles")
    locations_state.append("CA")
    locations_zip_code.append(fake.postalcode_in_state(state_abbr = "CA"))