## EDA
### Users Table
- `id`: Unique identifier for each user (numeric)
- `created_at`: User creation timestamp (ISO date string)
- `attribution_source`: User acquisition source (tiktok, instagram, or organic)
- `country`: User's country (US, TR, or NL)
- `name`: User's name

### User Events Table
- `id`: Unique event identifier (numeric)
- `created_at`: Event timestamp (ISO date string)
- `user_id`: Reference to users table (numeric)
- `event_name`: Type of event (app_install, trial_started, trial_cancelled, subscription_started, subscription_renewed, subscription_cancelled)
- `amount_usd`: Transaction amount in USD (numeric)

In [92]:
import pandas as pd
import numpy as np
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

In [93]:
try:
    conn = sqlite3.connect('papcorns.sqlite')
except Exception as e:
    print(e)

In [94]:
users_df = pd.read_sql_query("SELECT * FROM users LIMIT 5;", conn)
print("Users table preview:")
display(users_df)

# Load events table
events_df = pd.read_sql_query("SELECT * FROM user_events LIMIT 5;", conn)
print("\nUser events table preview:")
display(events_df)

Users table preview:


Unnamed: 0,id,created_at,attribution_source,country,name
0,1,2024-05-07T00:00:00,instagram,US,Eve Brown
1,2,2024-10-12T00:00:00,instagram,NL,Frank Moore
2,3,2024-10-15T00:00:00,tiktok,TR,Ivy Anderson
3,4,2024-08-28T00:00:00,tiktok,TR,Alice Brown
4,5,2024-04-03T00:00:00,organic,NL,Bob Moore



User events table preview:


Unnamed: 0,id,created_at,user_id,event_name,amount_usd
0,1,2024-05-07T00:00:00,1,app_install,
1,2,2024-05-12T00:00:00,1,trial_started,
2,3,2024-05-24T00:00:00,1,trial_cancelled,
3,4,2024-10-12T00:00:00,2,app_install,
4,5,2024-10-13T00:00:00,2,trial_started,


In [95]:
users_df = pd.read_sql_query("SELECT * FROM users", conn)

In [96]:
events_df = pd.read_sql_query("SELECT * From user_events ", conn)

In [97]:
users_df.describe

<bound method NDFrame.describe of         id           created_at attribution_source country           name
0        1  2024-05-07T00:00:00          instagram      US      Eve Brown
1        2  2024-10-12T00:00:00          instagram      NL    Frank Moore
2        3  2024-10-15T00:00:00             tiktok      TR   Ivy Anderson
3        4  2024-08-28T00:00:00             tiktok      TR    Alice Brown
4        5  2024-04-03T00:00:00            organic      NL      Bob Moore
...    ...                  ...                ...     ...            ...
997    998  2025-02-01T00:00:00          instagram      TR      Bob Davis
998    999  2024-12-24T00:00:00            organic      NL  Charlie Davis
999   1000  2025-02-13T00:00:00            organic      NL  Jack Anderson
1000  1001  2025-02-16T00:00:00          instagram      US    Bruce Wayne
1001  1002  2025-02-16T00:00:00            organic      TR     Clark Kent

[1002 rows x 5 columns]>

In [98]:
events_df.describe

<bound method NDFrame.describe of         id           created_at  user_id            event_name  amount_usd
0        1  2024-05-07T00:00:00        1           app_install         NaN
1        2  2024-05-12T00:00:00        1         trial_started         NaN
2        3  2024-05-24T00:00:00        1       trial_cancelled         NaN
3        4  2024-10-12T00:00:00        2           app_install         NaN
4        5  2024-10-13T00:00:00        2         trial_started         NaN
...    ...                  ...      ...                   ...         ...
3481  3482  2025-02-25T00:00:00     1000       trial_cancelled         NaN
3482  3483  2025-02-25T00:00:00     1001           app_install         NaN
3483  3484  2025-02-25T00:00:00     1001         trial_started         NaN
3484  3485  2025-02-25T00:00:00     1001  subscription_started        9.99
3485  3486  2025-02-25T00:00:00     1002           app_install         NaN

[3486 rows x 5 columns]>

Let's start with checking missing values

In [99]:
users_df.isna().any()

id                    False
created_at            False
attribution_source    False
country               False
name                  False
dtype: bool

In [100]:
events_df.isna().any()

id            False
created_at    False
user_id       False
event_name    False
amount_usd     True
dtype: bool

Let's explore the missing values of `events_df`

In [101]:
events_df.isna().sum()

id               0
created_at       0
user_id          0
event_name       0
amount_usd    2255
dtype: int64

It turns out, 2255 rows of the 3486 are missing in the `amount_usd` column. This also means that amount of transactions coming from users are 1231.

In [102]:
events_df["event_name"].value_counts()

event_name
app_install               1002
subscription_renewed       750
trial_started              682
subscription_started       481
subscription_cancelled     370
trial_cancelled            201
Name: count, dtype: int64

Since `subscription_renewed` (750) + `subscription_cancelled` (481) = 1231, this would mean the users who subscribed to the service all paid for the service, and other users who didn't are the ones that show up as `Nan`. Let's check by checking if there are any users who paid for the service but amount_used is na.

In [103]:
subscription_events = events_df[events_df['event_name'].isin(['subscription_started', 'subscription_renewed'])] #selecting events that should be paid
missing_amount = subscription_events[subscription_events['amount_usd'].isna()] #selecting events that dont have any transactions
print(f"Total subscription events: {len(subscription_events)}")
print(f"Subscription events with missing amount: {len(missing_amount)}")

Total subscription events: 1231
Subscription events with missing amount: 0


In [104]:
events_df["amount_usd"].nunique()

3

In [105]:
unique_values = events_df['amount_usd'].unique()
print(unique_values)

[ nan 8.99 4.99 9.99]


We can understand that there are 3 subscription tiers.

## Tasks