# Unit 2 Exploring Text Data: Unveiling the Details of SMS Spam Collection

# Introduction and Overview

Welcome back! In this lesson, we're going to dig deeper and explore the SMS Spam Collection dataset. We'll learn how to find out more information about the dataset like the unique counts, and some basic statistics. Understanding these details about the dataset is hugely important while working on Natural Language Processing (NLP) tasks, as they can drive the preprocessing and modeling steps.

# Exploring the Dataset

To get more details about the `DataFrame`, such as the datatypes of the columns and non-null counts, you can use the `info()` function[cite: 1]. This method prints information about a `DataFrame` including the index dtype and columns, non-null values, and memory usage[cite: 1].

```python
# Show detailed information about the dataset
print(df.info())
```

The output of the above code will be:

```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   label    5572 non-null   object
 1   message  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB
```

This output shows the `DataFrame` structure detailing that it has two columns (`label` and `message`) with 5572 entries each[cite: 1]. Both columns consist of objects (`dtype: object`), which means they are stored as strings in pandas[cite: 1]. There are no null values in either column, since the "Non-Null Count" is 5572 for both the `label` and `message` columns[cite: 1].

# Identifying Column Names

An essential preliminary step in data exploration is identifying the names of the columns in the DataFrame. Knowing the column names aids in efficiently accessing and manipulating data. Use the `columns` attribute to list all column names in the DataFrame:

```python
# List all column names
print(df.columns)
```

This simple line of code will output the names of the columns in your dataset, making it easier for you to reference specific data points as you continue your analysis:

```
Index(['label', 'message'], dtype='object')
```

Understanding the column names in your dataset is crucial for applying specific data manipulation and analysis techniques effectively.

# Understanding Unique Values

Now that we have a basic understanding of the structure of the data, let's learn more about the content of the data. We can use the `nunique()` function to count the number of unique messages in the 'sms' column and the `unique()` function to find unique labels in the 'label' column.

```python
# Count the number of unique messages and labels
print("Unique messages:", df['message'].nunique())
# Returns the unique labels
print("Labels:", df['label'].unique())
```

The output of the above code will be:

```
Unique messages: 5169
Labels: ['ham' 'spam']
```

This output indicates that there are 5169 unique messages in the dataset[cite: 1], and the 'label' column contains two unique values, 'ham' and 'spam', which represent non-spam and spam messages, respectively[cite: 1]. This information is critical in understanding the diversity and distribution of the dataset.

# Descriptive Statistics

Finally, let's get some basic statistics about the data. Pandas provides the `describe()` function which, by default, provides a statistical summary of all numerical columns.

```python
# Display basic statistics
print(df.describe())
```

The output of the above code will be:

```
       label                 message
count   5572                    5572
unique     2                    5169
top      ham  Sorry, I'll call later
freq    4825                      30
```

This output details the basic statistics for the 'message' column, signifying there are 5572 counts, with 5169 unique messages[cite: 1]. The most common message is "Sorry, I'll call later" which appears 30 times[cite: 1]. This summary gives insight into the repetitive nature of some messages in the dataset.

# Lesson Summary and Practice

In this lesson, we've taken a deeper look at our dataset using Python and pandas. We've learned how to use the pandas `info()`, `columns`, `nunique()`, `unique()`, and `describe()` functions to get more information on our dataset. Understanding the composition of the dataset is a very important step while working on NLP tasks. In our next exercises, we'll practice implementing these methods in order to reinforce what we've learned. Keep up the great work!

## Exploring SMS Spam Dataset Insights

Dive hands-on into the SMS Spam Collection dataset using key pandas functions for data exploration. Press Run without altering the provided code to see these functions in action and enhance your understanding of dataset analysis.


```python
import pandas as pd
from datasets import load_dataset

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Show detailed information about the dataset including datatype and non-null counts
print(df.info(), "\n")

# Display a list of all columns in the DataFrame
print(df.columns, "\n")

# Count the number of unique messages
print("Unique messages:", df['message'].nunique(), "\n")

# Returns the unique labels
print("Labels:", df['label'].unique(), "\n")

# Display basic statistics
print(df.describe())



```

```python
import pandas as pd
from datasets import load_dataset

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Show detailed information about the dataset including datatype and non-null counts
print(df.info(), "\n")

# Display a list of all columns in the DataFrame
print(df.columns, "\n")

# Count the number of unique messages
print("Unique messages:", df['message'].nunique(), "\n")

# Returns the unique labels
print("Labels:", df['label'].unique(), "\n")

# Display basic statistics
print(df.describe())
```

The code provided explores the SMS Spam Collection dataset using pandas functions. Executing the code will output various insights into the dataset's structure and content.

The `df.info()` output will show that the DataFrame has two columns, 'label' and 'message', each with 5572 non-null entries[cite: 1, 2]. Both columns are of `object` dtype, indicating they store string values[cite: 1, 2].

The `df.columns` output will display the column names, which are 'label' and 'message'[cite: 1, 2].

The `df['message'].nunique()` output will show that there are 5169 unique messages in the dataset[cite: 1, 2]. The `df['label'].unique()` output will reveal the two unique labels: 'ham' and 'spam'[cite: 1, 2].

Finally, `df.describe()` will provide basic statistics. For the 'message' column, it will show a count of 5572, 5169 unique messages, and the most frequent message as "Sorry, I'll call later", appearing 30 times[cite: 1, 2].


## Fix the Column Reference Error

Dive into the SMS Spam Collection dataset with pandas to develop your troubleshooting skills. Can you find the error while working on tasks covering dataset insights? This exercise emphasizes the necessity of accurate column referencing in data exploration.


```python
import pandas as pd
from datasets import load_dataset

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Count the number of unique messages and labels
print("Unique messages:", df['sms'].nunique(), "\n")

# Returns the unique labels
print("Labels:", df['label'].unique(), "\n")


```

To fix the `KeyError` in the provided code, you need to change the column reference from `'sms'` to `'message'`.

The `df.info()` and `df.columns` output in the previous exercise showed that the dataset contains a column named 'message', not 'sms'.

Here's the corrected code:

```python
import pandas as pd
from datasets import load_dataset

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Count the number of unique messages and labels
# Corrected: Changed 'sms' to 'message'
print("Unique messages:", df['message'].nunique(), "\n")

# Returns the unique labels
print("Labels:", df['label'].unique(), "\n")
```

## Extracting Information and Columns from Dataset

In this your aim is to fill in missing code parts to extract detailed information and columns from our dataset. This will reinforce your ability to explore and understand datasets, a skill critical for data analysis.

```python
import pandas as pd
from datasets import load_dataset

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# TODO: Implement the info method to display detailed information about the dataset

# TODO: Display a list of all columns in the DataFrame

```

To complete the task, you need to use the `.info()` method on the DataFrame to display detailed information about it, and the `.columns` attribute to display a list of all columns.

Here's the completed code:

```python
import pandas as pd
from datasets import load_dataset

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# TODO: Implement the info method to display detailed information about the dataset
print("Detailed DataFrame Information:")
df.info()

# TODO: Display a list of all columns in the DataFrame
print("\nList of all columns:")
print(df.columns)
```

## Unlocking Insights: SMS Spam Data Exploration

Building on our progress, this task focuses on reinforcing your ability to inspect specific aspects of our dataset. Here, a crucial piece of code is missing. Your objective is to accurately fill in the blank to display basic statistics of our dataset.

```python
import pandas as pd
from datasets import load_dataset

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# TODO: Implement the describe method to display basic statistics

```
```python
import pandas as pd
from datasets import load_dataset

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# TODO: Implement the describe method to display basic statistics
print("Basic statistics of the DataFrame:")
print(df.describe())

```

## Mastering Pandas with SMS Spam Data

Reflecting on your journey through detailed dataset examination, you've become acquainted with pandas functions that reveal the distinct characteristics of data. It’s time to synthesize your insights. Your challenge is to explore the dataset's structure, and distill key descriptive statistics. This solidifies your understanding of the dataset's anatomy — vital for groundwork in data science projects.

```python
import pandas as pd
from datasets import load_dataset

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# TODO: Use the appropriate function to display the DataFrame's detailed information

# TODO: Print all columns present in the DataFrame

# TODO: Count and print the number of unique messages in the dataset

# TODO: Identify and print the unique labels found in the dataset

# TODO: Display basic descriptive statistics of the DataFrame

```

```python
import pandas as pd
from datasets import load_dataset

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# TODO: Use the appropriate function to display the DataFrame's detailed information
print("DataFrame Detailed Information:")
df.info()
print("\n" + "="*50 + "\n")

# TODO: Print all columns present in the DataFrame
print("Columns in the DataFrame:")
print(df.columns)
print("\n" + "="*50 + "\n")

# TODO: Count and print the number of unique messages in the dataset
print("Number of unique messages:")
print(df['message'].nunique())
print("\n" + "="*50 + "\n")

# TODO: Identify and print the unique labels found in the dataset
print("Unique labels in the dataset:")
print(df['label'].unique())
print("\n" + "="*50 + "\n")

# TODO: Display basic descriptive statistics of the DataFrame
print("Basic descriptive statistics of the DataFrame:")
print(df.describe())
print("\n" + "="*50 + "\n")
```