In [17]:
# --- 1.0 Data Ingestion and Initial Exploration ---

# ### 1. Setup and Imports
# This notebook is the first investigative pass at the raw data files
# to understand their structure and identify any initial issues.

import sys
from pathlib import Path
import pandas as pd
from loguru import logger

# Find the project's root by searching for a known file (like config.py)
current_path = Path().resolve()
project_root = None
for parent in current_path.parents:
    if (parent / 'emoji_sentiment_analysis' / 'config.py').exists():
        project_root = parent
        break
        
if project_root is None:
    raise FileNotFoundError("Could not find the project root. Make sure you are inside the repository.")

# Add the project's root directory to the Python path
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Now, import the config module from the project root
sys.path.append(str(project_root / 'emoji_sentiment_analysis'))
import config

# Access variables via the imported module
RAW_DATA_DIR = config.RAW_DATA_DIR
TEXT_COL = config.TEXT_COL
TARGET_COL = config.TARGET_COL

# ### 2. A Reusable Function for Data Inspection
# Let's create a function to avoid repeating the same code for each dataset.

def inspect_dataset(file_path: Path):
    """
    Loads a CSV file and prints its head and info.
    """
    logger.info(f"Loading and inspecting {file_path.name}...")
    try:
        df = pd.read_csv(file_path)
        
        # Display the first few rows to see the content and column names
        print(f"\n--- First 5 rows of {file_path.name} ---")
        print(df.head())
        
        # Display information about the DataFrame (columns, non-null counts, data types)
        print(f"\n--- DataFrame Info for {file_path.name} ---")
        df.info()
        
        return df
    
    except FileNotFoundError:
        logger.error(f"File not found: {file_path}")
        return None
    except Exception as e:
        logger.error(f"An error occurred while loading {file_path.name}: {e}")
        return None

# ### 3. Load and Inspect All Raw Datasets
# Now, let's use the function to load and inspect both datasets.

df1 = inspect_dataset(RAW_DATA_DIR / "1k_data_emoji_tweets_senti_posneg.csv")
print("-" * 50)
df2 = inspect_dataset(RAW_DATA_DIR / "15_emoticon_data.csv")

# ### 4. Summary of Key Findings
# Based on the initial exploration, here are the observations and planned next steps.

# **Dataset 1: `1k_data_emoji_tweets_senti_posneg.csv`**
# - Has a duplicate `Unnamed: 0` column that needs to be removed.
# - The text and target columns are named `post` and `sentiment`, respectively, and should be renamed to a standardized `text` and `label`.
# - Contains emoji characters that need to be handled, as they are not standard text.

# **Dataset 2: `15_emoticon_data.csv`**
# - Also contains an `Unnamed: 0` column to be removed.
# - This dataset is not for direct combination; it serves as a lookup table to identify and transform emojis in the primary dataset.
# - The emoji characters can be used to inform a feature engineering step on the primary dataset.

# **Next Steps:**
# Based on these findings, a data processing script (`dataset.py`) is required to:
# 1. Load the raw datasets.
# 2. Use `15_emoticon_data.csv` as a lookup table to replace emojis with a placeholder.
# 3. Clean and standardize the primary dataset (`1k_data_emoji_tweets_senti_posneg.csv`), including dropping the redundant `Unnamed: 0` column and renaming columns.
# 4. Save the final, clean dataset to the `data/processed` folder.
#
# This data exploration justifies the need for the `dataset.py` script and its specific functionalities.

[32m2025-09-22 14:07:37.557[0m | [1mINFO    [0m | [36m__main__[0m:[36minspect_dataset[0m:[36m43[0m - [1mLoading and inspecting 1k_data_emoji_tweets_senti_posneg.csv...[0m
[32m2025-09-22 14:07:37.587[0m | [1mINFO    [0m | [36m__main__[0m:[36minspect_dataset[0m:[36m43[0m - [1mLoading and inspecting 15_emoticon_data.csv...[0m



--- First 5 rows of 1k_data_emoji_tweets_senti_posneg.csv ---
   Unnamed: 0  sentiment                                               post
0           0          1                             Good morning every one
1           1          0  TW: S AssaultActually horrified how many frien...
2           2          1  Thanks by has notice of me Greetings : Jossett...
3           3          0                      its ending soon aah unhappy 😧
4           4          1                               My real time happy 😊

--- DataFrame Info for 1k_data_emoji_tweets_senti_posneg.csv ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  1000 non-null   int64 
 1   sentiment   1000 non-null   int64 
 2   post        1000 non-null   object
dtypes: int64(2), object(1)
memory usage: 23.6+ KB
--------------------------------------------------

--- Fir