# 1.0 Data Ingestion and Initial Exploration 

## 📑 Notebook Overview

This notebook performs an **initial exploration of raw emoji sentiment datasets** to guide data cleaning and preprocessing.  

**Objectives:**
- Ingest and inspect raw data files.  
- Identify structural issues (column naming, duplicates, emoji handling).  
- Document findings to inform a data processing script (`dataset.py`).  


### Section 1: Setup and Imports
This notebook is the first investigative pass at the raw data files
to understand their structure and identify any initial issues.


In [None]:
# Code for Setup and Imports

import sys
from pathlib import Path
import pandas as pd
from loguru import logger

# Find the project's root by searching for a known file (like config.py)
current_path = Path().resolve()
project_root = None
for parent in current_path.parents:
    if (parent / 'emoji_sentiment_analysis' / 'config.py').exists():
        project_root = parent
        break
        
if project_root is None:
    raise FileNotFoundError("Could not find the project root. Make sure you are inside the repository.")

# Add the project's root directory to the Python path
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Now, import the config module from the project root
sys.path.append(str(project_root / 'emoji_sentiment_analysis'))
import config

# Access variables via the imported module
RAW_DATA_DIR = config.RAW_DATA_DIR
TEXT_COL = config.TEXT_COL
TARGET_COL = config.TARGET_COL



### Section 2: A Reusable Function for Data Inspection
Let's create a function to avoid repeating the same code for each dataset.



In [None]:
# Code for the Function

def inspect_dataset(file_path: Path):
    """
    Loads a CSV file and prints its head and info.
    """
    logger.info(f"Loading and inspecting {file_path.name}...")
    try:
        df = pd.read_csv(file_path)
        
        # Display the first few rows to see the content and column names
        print(f"\n--- First 5 rows of {file_path.name} ---")
        print(df.head())
        
        # Display information about the DataFrame (columns, non-null counts, data types)
        print(f"\n--- DataFrame Info for {file_path.name} ---")
        df.info()
        
        return df
    
    except FileNotFoundError:
        logger.error(f"File not found: {file_path}")
        return None
    except Exception as e:
        logger.error(f"An error occurred while loading {file_path.name}: {e}")
        return None



### Section 3: Load and Inspect All Raw Datasets
Now, let's use the function to load and inspect both datasets.



In [None]:
# Code to Inspect the First Dataset

df1 = inspect_dataset(RAW_DATA_DIR / "1k_data_emoji_tweets_senti_posneg.csv")



[32m2025-09-22 15:25:07.770[0m | [1mINFO    [0m | [36m__main__[0m:[36minspect_dataset[0m:[36m6[0m - [1mLoading and inspecting 1k_data_emoji_tweets_senti_posneg.csv...[0m



--- First 5 rows of 1k_data_emoji_tweets_senti_posneg.csv ---
   Unnamed: 0  sentiment                                               post
0           0          1                             Good morning every one
1           1          0  TW: S AssaultActually horrified how many frien...
2           2          1  Thanks by has notice of me Greetings : Jossett...
3           3          0                      its ending soon aah unhappy 😧
4           4          1                               My real time happy 😊

--- DataFrame Info for 1k_data_emoji_tweets_senti_posneg.csv ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  1000 non-null   int64 
 1   sentiment   1000 non-null   int64 
 2   post        1000 non-null   object
dtypes: int64(2), object(1)
memory usage: 23.6+ KB


In [None]:
# Code to Inspect the Second Dataset

print("-" * 50)
df2 = inspect_dataset(RAW_DATA_DIR / "15_emoticon_data.csv")



[32m2025-09-22 15:25:08.133[0m | [1mINFO    [0m | [36m__main__[0m:[36minspect_dataset[0m:[36m6[0m - [1mLoading and inspecting 15_emoticon_data.csv...[0m


--------------------------------------------------

--- First 5 rows of 15_emoticon_data.csv ---
   Unnamed: 0 Emoji Unicode codepoint                         Unicode name
0           0     😍           0x1f60d  SMILING FACE WITH HEART-SHAPED EYES
1           1     😭           0x1f62d                   LOUDLY CRYING FACE
2           2     😘           0x1f618                 FACE THROWING A KISS
3           3     😊           0x1f60a       SMILING FACE WITH SMILING EYES
4           4     😁           0x1f601      GRINNING FACE WITH SMILING EYES

--- DataFrame Info for 15_emoticon_data.csv ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Unnamed: 0         16 non-null     int64 
 1   Emoji              16 non-null     object
 2   Unicode codepoint  16 non-null     object
 3   Unicode name       16 non-null     object
dtypes: int64(1), ob

### 📊 Key Findings and Next Steps

**Dataset 1: `1k_data_emoji_tweets_senti_posneg.csv`**
- Contains a redundant `Unnamed: 0` column.
- Columns `post` and `sentiment` need renaming to standardized `text` and `label`.
- Contains emojis that require preprocessing.

**Dataset 2: `15_emoticon_data.csv`**
- Also has an `Unnamed: 0` column.
- Functions as a lookup table for emoji transformations, not for direct modeling.

**Next Steps:**
1. Build `dataset.py` to handle cleaning, column renaming, and emoji replacement.  
2. Save the processed dataset to `data/processed/`.  
