In [1]:
"""
There are six files in all: train.csv, test.csv, users.csv, 
user_friends.csv, events.csv, and event_attendees.csv.

train.csv has six columns:  user, event, invited, timestamp, 
interested, and not_interested.  Test.csv contains the same 
columns as train.csv, except for interested and not_interested. 
Each row corresponds to an event that was shown to a user in 
our application.  event is an id identifying an event in a our system.  
user is an id representing a user in our system.  invited is a 
binary variable indicated whether the user has been invited to 
the event. timestamp is a ISO-8601 UTC time string representing 
the approximate time (+/- 2 hours) when the user saw the event in 
our application. interested is a binary variable indicating whether 
a user clicked on the "Interested" button for this event; it is 1 
if the user clicked Interested and 0 if the user did not click the 
button.  Similarly, not_interested is a binary variable indicating 
whether a user clicked on the "Not Interested" button for this event; 
it is 1 if the user clicked the button and 0 if not.  It is possible 
that the user saw an event and clicked neither Interested nor Not 
Interested, and hence there are rows that contain 0,0 as values for 
interested,not_interested.

users.csv contains demographic data about our some of our users 
(including all of the users appearing in the train and test files), 
and it has the following columns: user_id, locale, birthyear, 
gender, joinedAt, location, and timezone. user_id is the id of 
the user in our system.  locale is a string representing the 
user's locale, which should be of the form language_territory. 
birthyear is a 4-digit integer representing the year when the user 
was born. gender is either male or female, depending on the user's 
gender.  joinedAt is an ISO-8601 UTC time string representing when 
the user first used our application.  location is a string 
representing the user's location (if known).  timezone is a 
signed integer representing the user's UTC offset (in minutes).

user_friends.csv contains social data about this user, and contains 
two columns:  user and friends.  user is the user's id in our system, 
and friends is a space-delimited list of the user's friends' ids.

events.csv contains data about events in our system, and has 110 
columns.  The first nine columns are event_id, user_id, start_time, 
city, state, zip, country, lat, and lng.  event_id is the id of 
the event, and user_id is the id of the user who created the event.  
city, state, zip, and country represent more details about the 
location of the venue (if known).  lat and lng are floats 
representing the latitude and longitude coordinates of the venue, 
rounded to three decimal places.  start_time is the ISO-8601 UTC 
time string representing when the event is scheduled to begin.  
The last 101 columns require a bit more explanation; first, we 
determined the 100 most common word stems (obtained via Porter 
Stemming) occuring in the name or description of a large random 
subset of our events.  The last 101 columns are count_1, count_2, 
..., count_100, count_other, where count_N is an integer representing 
the number of times the Nth most common word stem appears in the 
name or description of this event.  count_other is a count of the 
rest of the words whose stem wasn't one of the 100 most common stems.

event_attendees.csv contains information about which users attended 
various events, and has the following columns: event_id, yes, maybe, 
invited, and no. event_id identifies the event. yes, maybe, invited, 
and no are space-delimited lists of user id's representing users who 
indicated that they were going, maybe going, invited to, or not going 
to the event.
"""

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

In [2]:
train_df = pd.read_csv('./data/30Nov2020/train.csv')
test_df = pd.read_csv('./data/30Nov2020/test.csv')
users_df = pd.read_csv('./data/30Nov2020/users.csv')
user_friends_df = pd.read_csv('./data/30Nov2020/user_friends.csv')
events_df = pd.read_csv('./data/30Nov2020/events.csv')
event_attendees_df = pd.read_csv('./data/30Nov2020/event_attendees.csv')

In [3]:
train_df.sample(3)

Unnamed: 0,user,event,invited,timestamp,interested,not_interested
10436,3005728385,4066934337,0,2012-11-28 04:36:46.719000+00:00,0,0
11742,3338812734,2321179843,0,2012-10-14 07:09:24.693000+00:00,0,0
915,254306319,2678620069,0,2012-12-04 00:41:22.231000+00:00,0,0


In [4]:
train_df.shape

(15398, 6)

In [5]:
test_df.sample(3)

Unnamed: 0,user,event,invited,timestamp
7133,2991097972,3420008322,0,2012-10-19 00:33:29.001000+00:00
3049,1227878272,2374444123,0,2012-10-30 05:17:13.159000+00:00
1286,547302435,440098296,0,2012-10-28 03:44:37.293000+00:00


In [6]:
test_df.shape

(10237, 4)

In [7]:
users_df.sample(3)

Unnamed: 0,user_id,locale,birthyear,gender,joinedAt,location,timezone
3579,3149755720,id_ID,1987,female,2012-10-25T11:04:51.372Z,Medan Indonesia,420.0
1384,1566352766,id_ID,1993,male,2012-09-25T00:31:57.152Z,Medan Indonesia,420.0
5840,4143773682,en_US,1985,male,2012-11-08T23:28:08.647Z,Phnom Penh,420.0


In [8]:
users_df.shape

(38209, 7)

In [9]:
user_friends_df.sample(3)

Unnamed: 0,user,friends
10711,3743850608,2744225667 60476996 302756193 2498417136 80506...
23542,2946039837,3710022142 1360644655 370486744 793854333 1336...
26296,3226465762,3150782240 3637747987 3043246176 1628627566 33...


In [10]:
user_friends_df.shape

(38202, 2)

In [11]:
events_df.sample(3)

Unnamed: 0,event_id,user_id,start_time,city,state,zip,country,lat,lng,c_1,...,c_92,c_93,c_94,c_95,c_96,c_97,c_98,c_99,c_100,c_other
3103747,1717103660,4273148690,2012-08-19T03:30:00.000Z,Germantown,MD,,United States,39.177,-77.257,1,...,0,0,0,0,0,0,0,0,0,22
384248,150717235,2257006975,2012-10-20T01:00:00.003Z,,,,,,,1,...,0,0,0,1,0,0,1,0,0,27
2041947,2260075719,3293560262,2012-11-15T06:00:00.003Z,,,,,,,2,...,0,0,0,0,0,0,0,0,0,4


In [12]:
events_df.shape

(3137972, 110)

In [13]:
event_attendees_df.sample(3)

Unnamed: 0,event,yes,maybe,invited,no
22950,1977406305,99240058 2545371972,,2138442061 766167504 1677343422 2159814197,
2887,1180440770,4132048871 884148328 2612802385 3378716125 100...,2060527614 3778434404 3462370471 815532795 106...,490203760 3156449290 143232238 510542959 26692...,1944656963 1191949470 4037579500 3478748057 16...
10879,361060814,2365261832 1800193892 2913057748 1128043319 88...,2654861659 1731395369 1233195359 212561420 220...,3026117162 1170301617 483420821 2109104539 211...,223124516 1148188143


In [14]:
event_attendees_df.shape

(24144, 5)