## 1. Data Description

### 1.1 Context

This dataset contains **social media discussions** related to the viral Vietnamese indie game  
**“Tiệm Phở của Anh Hai” (Brother Hai’s Pho Restaurant)** — a student-made game that gained widespread  
attention across Facebook and TikTok in late 2025.

The data was collected from **public social media sources** using keyword-based crawling.  
Each record represents a **post, share, or mention** about the game, containing both textual content  
and metadata (e.g., engagement metrics, author info, platform type).

The dataset enables exploration of **user engagement, content sentiment, and behavioral patterns**  
around how Vietnamese online communities discuss a trending local game.

---

### 1.2 Data Sources and Scope

- **Collection Period:** Early November 2025 (the week of the game’s viral peak).  
- **Platforms:** Primarily Facebook (`platform = 1`), TikTok (`platform = 9`) and YouTube (`platform = 7`).  
- **Language:** Vietnamese (UTF-8 encoded).  
- **Data Type:** Structured JSON with nested text fields (`search_text`).  
- **Record Unit:** Each entry represents a single *social media mention*.  

---

### 1.3 Selected Features

The dataset originally contained 25+ columns.  
After cleaning and selection, the following key fields are retained for analysis:

| Field | Description | Type | Example |
|-------|--------------|------|----------|
| `title` | Short caption extracted from `search_text[0]` | text | "CÓ GAME VIỆT VIRAL TRƯỚC CẢ GTA6!!!" |
| `text` | Main body of the post (`search_text[1]`) | text | "Brother Hai’s Pho Restaurant là đồ án tốt nghiệp..." |
| `platform` | Social media platform (1 = Facebook, 9 = TikTok) | int | 1 |
| `domain` | Platform domain name | string | "facebook.com" |
| `source_type` | Source type (1 = User, 2 = Fanpage, 3 = Group) | int | 2 |
| `source_name` | Name of the posting account/page | string | "Fastcare - Hệ Thống Sửa Chữa Điện Thoại" |
| `identity_name` | Display name of the account | string | "Fastcare" |
| `identity_gender` | Gender code (0 = Unknown, 1 = Male, 2 = Female) | int | 0 |
| `identity_city` | City or region code (0 = Unknown) | int | 0 |
| `likes` | Number of likes | int | 254 |
| `shares` | Number of shares | int | 36 |
| `comments` | Number of comments | int | 28 |
| `views` | View count of the post | int | 5400 |
| `engagement_total` | Sum of likes + shares + comments | int | 318 |
| `created_date` | UTC timestamp when the post was published | datetime | 2025-11-04T07:00:04Z |
| `mention_type` | Type of mention (1 = post, 2 = comment, etc.) | int | 1 |

---

### 1.4 Analytical Focus

The selected attributes capture **three complementary perspectives** of social engagement:

1. **Content and Language**
   - `title`, `text`  
   → used to explore trending topics, keywords, and sentiment.

2. **Behavioral Engagement**
   - `likes`, `shares`, `comments`, `views`, `engagement_total`  
   → used to measure virality and audience reactions.

3. **Contextual & Demographic Information**
   - `platform`, `source_type`, `identity_gender`, `identity_city`, `created_date`  
   → used to identify *where*, *who*, and *when* content was generated.

---

### 1.5 Data Quality Notes

- The dataset includes some **text noise** such as emojis, hashtags, and informal language.  
- `identity_gender` and `identity_city` are partially missing (≈70–80% unknown).  
- Engagement counts show **high variance**, with a few viral posts having very large values.  
  → Outlier handling and log-scaling will be required in the preprocessing step.  
- Non-ASCII Vietnamese characters are retained for linguistic analysis.

---

### 1.6 Planned Visual Exploration

The following charts will be used to describe dataset characteristics:

| Visualization | Purpose |
|----------------|----------|
| **Platform distribution** | Show share of posts by platform (Facebook vs TikTok). |
| **Source type distribution** | Compare content origin: user vs fanpage vs group. |
| **Post volume over time** | Visualize daily activity and detect viral peaks. |
| **Engagement histogram** | Display skewness in interaction counts. |
| **Top sources by engagement** | Identify most influential accounts or pages. |
| **Gender/City breakdown (if available)** | Explore simple demographic distribution. |

These exploratory visuals help establish an understanding of the dataset’s coverage  
before progressing to **data preparation, sentiment classification, and engagement modeling** in later sections.
