## (1) Data Description:

## Dataset Overview

| File           | Purpose                                 | Rows (Observations) | Columns (Variables) |
| -------------- | --------------------------------------- | ------------------- | ------------------- |
| `sessions.csv` | Gameplay session logs per user          | **1535**            | **5**               |
| `players.csv`  | Player demographic & membership profile | **196**             | **7**               |

The datasets appear linked through a **hashedEmail**

---

## Variable Summary

### **`sessions.csv` Variables**

| Variable              | Type (Inferred)                     | Description                        |
| --------------------- | ----------------------------------- | ---------------------------------- |
| `hashedEmail`         | Categorical (string)                | Anonymized unique user identifier  |
| `start_time`          | String (datetime, needs conversion) | Start time of the recorded session |
| `end_time`            | String (datetime, needs conversion) | End time of the recorded session   |
| `original_start_time` | Numeric (Unix timestamp, ms)        | Original system-logged start time  |
| `original_end_time`   | Numeric (Unix timestamp, ms)        | Original system-logged end time    |

---

### **`players.csv` Variables**

| Variable       | Type (Inferred)                                   | Description                                        |
| -------------- | ------------------------------------------------- | -------------------------------------------------- |
| `experience`   | Ordinal categorical (Beginner → Pro)              | Skill level                                        |
| `subscribe`    | Boolean categorical                               | Whether user is subscribed (True/False)            |
| `hashedEmail`  | Categorical (key)                                 | Unique encrypted identifier (matches sessions.csv) |
| `played_hours` | Numeric (continuous)                              | Total lifetime playtime                            |
| `name`         | Categorical                                       | Player alias                                       |
| `gender`       | Categorical                                       | Gender identification                              |
| `Age`          | Numeric (continuous)                              | Age in years                                       |

---

## Summary Statistics

### **Numeric Variables – Sessions**

| Metric  | `original_start_time` | `original_end_time` |
| ------- | --------------------- | ------------------- |
| Count   | 1535                  | 1533                |
| Mean    | 1.71920e+12           | 1.71920e+12         |
| Std Dev | 3.56e+09              | 3.55e+09            |
| Min     | 1.71240e+12           | 1.71240e+12         |
| 25%     | 1.71624e+12           | 1.71624e+12         |
| Median  | 1.71920e+12           | 1.71918e+12         |
| 75%     | 1.72189e+12           | 1.72189e+12         |
| Max     | 1.72733e+12           | 1.72734e+12         |

**Timestamps are in milliseconds and need conversion**

---

### **Numeric Variables – Players**

| Metric  | `played_hours` | `Age` |
| ------- | -------------- | ----- |
| Count   | 196            | 194   |
| Mean    | 6.19           | 21.14 |
| Std Dev | 17.83          | 7.39  |
| Min     | 0.00           | 9.00  |
| 25%     | 0.10           | 17.00 |
| Median  | 0.60           | 19.00 |
| 75%     | 2.30           | 22.75 |
| Max     | 223.10         | 58.00 |

---

## Data Quality Observations & Issues

### Observations

* Both datasets share a unique key connecting them
* No apparent duplicates in players

### Issues

| Issue                               | Explanation                                     |
| ----------------------------------- | ----------------------------------------------- |
| Datetime stored as text             | Needs conversion to proper datetime objects     |
| Some missing values                 | `end_time` and `Age` missing for some records   |
| Potential fake names                | Every name is unique, uncommon distribution     |
| Single-time-unit timestamps unclear | `original_*` likely needs rescaling or timezone |
| No session duration field           | Requires calculated metric                      |

---

## Possible Unseen Issues

* Player accounts may be shared between multiple real humans
* Sessions could include idle time 
* Mixed device/platform logging formats could influence timestamps
* Experience level might be self-reported → **biased**
* No validation that `played_hours` = sum of session durations
* True identities unknown — **unverifiable demographics**
* Time drift between `start_time` and `original_start_time` sources

---

## How Data Were Likely Collected

Based on structure, this appears to be:

1. **Automated telemetry logging** from the game
2. Player demographic survey 
3. Session logs stored passively with timestamps
4. Later exported to CSV for analysis

## (2) Questions:

## Can "experience" predict "played hours" in players.csv?
Using linear regression, the experience variable is first converted into an numeric scale (for example, Beginner = 0, Amateur = 1, Regular = 2, Veteran = 3, Pro = 4) so that its progression can be interpreted mathematically. This experience value is then used as the X value in the regression model, while future played hours serves as the Y value. The regression algorithm fits a line of best fit through the data points by estimating coefficients that describe the relationship between experience and playtime. In order to wrangle a text file with these values, the first step is to read the csv and remove all the unneeded fields. The only fields to be kept are exclusively experience and played hours. Next, experience has to be converted into a numeric scale, assigning numbers to each predetermined string. Lastly, to prevent extreme skew, it might be needed to perform a log transform on the number of hours played. 

## (3) Exploratory Data Analysis and Visualization:

In [None]:
library(repr)
library(tidyverse)

In [None]:
df = read_csv("https://raw.githubusercontent.com/ubc-danielX/dsci-100-2025w1-group-008-13/refs/heads/main/players.csv")
df
# find the mean of hours played and age
mean_played <- mean(df$played_hours, na.rm = TRUE)
mean_age <- mean(df$Age, na.rm = TRUE)
mean_summery <- df summarise(
                mean_played_hours = mean(played_hours, na.rm = TRUE),
                mean_age = mean(Age, na.rm = TRUE),
                total_count = n()
                )
mean_played
mean_age

## (4) Methods and Plan: