<a href="https://colab.research.google.com/github/tfindiamooc/mlp/blob/main/TextAnalysisClass5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Lesson #5: Reading Text Data from CSV File (Quickstart)

Welcome back! In this part of the lesson, we'll demonstrate data reading process from csv file

This approach simulates a more realistic scenario where your text data is stored in an external file. You will learn how to handle reading data from a local file, which is a common task in machine learning projects.



Let's start by generating and saving our CSV data to a local file, and then reading it back in!

In [2]:
# @title Review Generation Code
import pandas as pd
import numpy as np # Import numpy for creating lists

# 1. Sample data lists (text and labels)
text_data_list = [
    "This movie was absolutely fantastic! The acting was superb and the plot kept me hooked.",
    "I wasted my time and money. The plot was nonsensical and the acting was atrocious.",
    "The book was okay, nothing special.  The characters were a bit bland.",
    "I loved this product! It exceeded my expectations. Highly recommend it.",
    "This is the worst service I have ever experienced.  Avoid at all costs!",
    "A decent read, quite enjoyable in parts.  The writing style was good.",
]

label_data_list = [
    "positive",
    "negative",
    "neutral",
    "positive",
    "negative",
    "neutral",
]

# 2. Create a pandas DataFrame from the lists
data_dict = {'text': text_data_list, 'label': label_data_list}
data = pd.DataFrame(data_dict)

# 3. Define the CSV file name
csv_file_name = 'text_data.csv'

# 4. Save the DataFrame to a local CSV file
data.to_csv(csv_file_name, index=False) # index=False prevents writing DataFrame index to CSV

print(f"CSV file '{csv_file_name}' created successfully in the local environment.")

CSV file 'text_data.csv' created successfully in the local environment.


1.  **Sample CSV Data (Generated Programmatically)**:
    *   Instead of a string, we now create the data programmatically using Python lists: `text_data_list` and `label_data_list`. These lists hold the same example text documents and labels as before.
    *   We then create a pandas DataFrame `data` directly from these lists using a dictionary.

2.  **`data.to_csv(csv_file_name, index=False)`**:
    *   This line **saves the DataFrame `data` to a local CSV file**.
    *   `csv_file_name = 'text_data.csv'`: We define the name of the CSV file as `text_data.csv`. This file will be created in the current Colab environment's file storage (you can see it in the file pane on the left in Colab).
    *   `index=False`:  This argument prevents pandas from writing the DataFrame index as a column in the CSV file. We usually don't need to save the index to the CSV file.

In [3]:
# 1. Read the CSV file back into a pandas DataFrame
data_from_csv = pd.read_csv(csv_file_name)

# 2. Display the first few rows of the DataFrame read from CSV
print("\nFirst few rows of the DataFrame read from CSV file:")
print(data_from_csv.head())

# 3. Display information about the DataFrame (data types, non-null values)
print("\nDataFrame Information (from CSV file):")
print(data_from_csv.info())


First few rows of the DataFrame read from CSV file:
                                                text     label
0  This movie was absolutely fantastic! The actin...  positive
1  I wasted my time and money. The plot was nonse...  negative
2  The book was okay, nothing special.  The chara...   neutral
3  I loved this product! It exceeded my expectati...  positive
4  This is the worst service I have ever experien...  negative

DataFrame Information (from CSV file):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    6 non-null      object
 1   label   6 non-null      object
dtypes: object(2)
memory usage: 228.0+ bytes
None


### Understanding the Code and Loaded Data

In the code above, we have demonstrated how to read text data from a CSV format into a pandas DataFrame. Let's break down the steps:

1.  **`pd.read_csv(csv_file_name)`**:
    *   This line **reads the CSV file we just created** (`text_data.csv`) back into a new pandas DataFrame called `data_from_csv`.
    *   Now, `pd.read_csv()` is reading from an actual file in the local file system, simulating reading from an external data source.

2.  **`data_from_csv.head()`**:
    *   This displays the first few rows of the DataFrame that we read from the CSV file (`data_from_csv`). You should see the same `text` and `label` columns as before, confirming that the data has been saved to and read from the CSV file correctly.

3.  **`data_from_csv.info()`**:
    *   This provides the same summary information about the DataFrame read from the CSV file, as explained in Part 1.

**Key Points for Text Classification Data in CSV:**

*   **Text Column:**  Identify the column in your DataFrame that contains the text documents you want to classify. In our example, it's the `text` column.
*   **Label Column (Target):** Identify the column that contains the class labels or categories for each text document. In our example, it's the `label` column.  These labels are what your classification model will learn to predict.
*   **DataFrame Structure:**  You will typically work with a pandas DataFrame where each row represents a text document, and columns represent features (including the text itself and the labels).

By running this code, you will now not only read text data but also experience the process of generating and saving a CSV file locally and then reading it back, which is a more common workflow in data science projects. You are now even better prepared to handle real-world text data for classification!