# Unit 1 Exploring Text Data: Loading and Examining the SMS Spam Collection Dataset

# Lesson Overview

Welcome to the first lesson: meeting our dataset! You will learn how to load a text dataset using the Python programming language[cite: 1], how to perform some initial explorations using the `pandas` library[cite: 1], and finally how to convert the loaded dataset into a `pandas` DataFrame[cite: 1].

The dataset we will work with in this lesson is the popular SMS Spam Collection dataset[cite: 1], which is widely used in text classification tasks in the field of Natural Language Processing (NLP)[cite: 1].

## Loading Dataset via Python Library `datasets`

To load our SMS Spam Collection dataset, we will use the `load_dataset` function from the `datasets` library to load our dataset hosted in the CodeSignal platform[cite: 1], as demonstrated in this code snippet:

```python
from datasets import load_dataset

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
```

After loading the dataset, let's proceed to convert it to a `pandas` DataFrame for more convenient handling[cite: 1].

## Converting the Loaded Dataset to `pandas` DataFrame

Pandas' DataFrame is a two-dimensional labeled data structure with columns of potentially different types[cite: 1]. It is generally the most commonly used pandas object, perfect for data wrangling, manipulation and data analysis with integrated arithmetic operations and aggregations[cite: 1]. We'll be converting our `sms_spam` data into a `pandas` DataFrame[cite: 1].

The code snippet to perform this conversion is as follows:

```python
import pandas as pd

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])
```

The data stored under 'train' in the loaded dataset is converted into a `pandas` dataframe using the `pd.DataFrame()` function[cite: 1].

## Exploring Initial Entries Using `pandas` `head()` Method

One of the first steps in working with any dataset is to know what the dataset contains[cite: 1]. The easiest way to get a quick idea about the DataFrame is to use the `head()` method to show the first few rows[cite: 1].

The `head()` function is used to get the first n rows[cite: 1]. The number of rows to select is passed as an argument[cite: 1]. If no argument is passed, by default it returns the first 5 rows[cite: 1].

This is how you can use the `head()` method to preview the initial entries of the DataFrame:

```python
# Preview the first entries of the DataFrame
print(df.head())
```

The output of the above code will be:

```
  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
```

This output demonstrates the structure of the DataFrame containing the SMS data[cite: 1]. Each row represents a distinct message, with the 'label' column indicating whether the message is spam (`spam`) or not (`ham`) and the 'message' column containing the text of the message[cite: 1].

## Exploring Data with Pandas

Apply your knowledge of loading datasets with Python and pandas by executing the provided code, which loads the SMS Spam Collection dataset into a pandas DataFrame for initial exploration. No changes to the code are needed, just hit Run to execute it and observe the output, gaining insights into the dataset's structure.


```python
import pandas as pd
from datasets import load_dataset

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Preview the first entries of the DataFrame
print(df.head())


```

# Lesson Overview

Welcome to the first lesson: meeting our dataset! You will learn how to load a text dataset using the Python programming language[cite: 1], how to perform some initial explorations using the `pandas` library[cite: 1], and finally how to convert the loaded dataset into a `pandas` DataFrame[cite: 1].

The dataset we will work with in this lesson is the popular SMS Spam Collection dataset[cite: 1], which is widely used in text classification tasks in the field of Natural Language Processing (NLP)[cite: 1].

## Loading Dataset via Python Library `datasets`

To load our SMS Spam Collection dataset, we will use the `load_dataset` function from the `datasets` library to load our dataset hosted in the CodeSignal platform[cite: 1], as demonstrated in this code snippet:

```python
from datasets import load_dataset

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
```

After loading the dataset, let's proceed to convert it to a `pandas` DataFrame for more convenient handling[cite: 1].

## Converting the Loaded Dataset to `pandas` DataFrame

Pandas' DataFrame is a two-dimensional labeled data structure with columns of potentially different types[cite: 1]. It is generally the most commonly used pandas object, perfect for data wrangling, manipulation and data analysis with integrated arithmetic operations and aggregations[cite: 1]. We'll be converting our `sms_spam` data into a `pandas` DataFrame[cite: 1].

The code snippet to perform this conversion is as follows:

```python
import pandas as pd

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])
```

The data stored under 'train' in the loaded dataset is converted into a `pandas` dataframe using the `pd.DataFrame()` function[cite: 1].

## Exploring Initial Entries Using `pandas` `head()` Method

One of the first steps in working with any dataset is to know what the dataset contains[cite: 1]. The easiest way to get a quick idea about the DataFrame is to use the `head()` method to show the first few rows[cite: 1].

The `head()` function is used to get the first n rows[cite: 1]. The number of rows to select is passed as an argument[cite: 1]. If no argument is passed, by default it returns the first 5 rows[cite: 1].

This is how you can use the `head()` method to preview the initial entries of the DataFrame:

```python
# Preview the first entries of the DataFrame
print(df.head())
```

The output of the above code will be:

```
  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
```

This output demonstrates the structure of the DataFrame containing the SMS data[cite: 1]. Each row represents a distinct message, with the 'label' column indicating whether the message is spam (`spam`) or not (`ham`) and the 'message' column containing the text of the message[cite: 1].

## Expanding Data Preview with Python

Building on your journey through data exploration, you've learned to load the SMS Spam Collection dataset and inspect its initial entries using the head() method.

Now, let's tweak the way we preview data. Moving beyond the default settings, this practice requires you to change the number of displayed rows when using the head() method. Instead of the default first five entries, adjust the code to display the first 10 entries. This small yet practical adjustment will familiarize you with customizing data previews, a handy skill for any data exploration task.


```python
import pandas as pd
from datasets import load_dataset

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# TODO: Change the below code to display the first 10 entries of the DataFrame
print(df.head())


```

To display the first 10 entries of the DataFrame, you need to pass `10` as an argument to the `head()` method[cite: 3].

```python
import pandas as pd
from datasets import load_dataset

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# TODO: Change the below code to display the first 10 entries of the DataFrame
print(df.head(10))
```

## Spot the Data Loading Bug

Identify and fix a mistake in a Python code snippet meant for loading the SMS Spam Collection dataset using pandas. This exercise hones your skill in debugging data loading errors. Correct the code to properly load and display the first entries of the dataset.


```python
import pandas as pd
from datasets import load_dataset

# Incorrect: Attempting to load the dataset directly into a pandas DataFrame
sms_spam = pd.DataFrame(load_dataset('codesignal/sms-spam-collection'))

# TODO: Fix the above line to correctly load the dataset and then convert it to a DataFrame

# Preview the first entries of the DataFrame
print(sms_spam.head())


```

The provided code attempts to directly load the dataset into a pandas DataFrame using `pd.DataFrame(load_dataset(...))`. However, the `load_dataset` function from the `datasets` library returns a `DatasetDict` object, which is not directly convertible to a pandas DataFrame in this manner[cite: 1]. To correctly load and convert the dataset, you first load the dataset into a variable, then access a specific split (like 'train') from that dataset object, and finally pass that split to the `pd.DataFrame()` constructor[cite: 1].

Here's the corrected code:

```python
import pandas as pd
from datasets import load_dataset

# Corrected: Load the dataset first, then convert the 'train' split to a pandas DataFrame
sms_spam_dataset = load_dataset('codesignal/sms-spam-collection') 
df = pd.DataFrame(sms_spam_dataset['train']) 

# Preview the first entries of the DataFrame
print(df.head())
```



## Expanding Your Data Preview Skills

Advance your data exploration skills by completing the code to preview the first 20 entries of the SMS Spam Collection dataset. This task focuses on deepening your understanding of data display options in Python.

```python
import pandas as pd
from datasets import load_dataset

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# TODO: Display the first 20 entries of the DataFrame


```

```python?code_reference&code_event_index=2
# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Display the first 20 entries of the DataFrame
print(df.head(20))
```
```text?code_stderr&code_event_index=2
Traceback (most recent call last):
  File "<string>", line 2, in <module>
NameError: name 'load_dataset' is not defined

```
```python?code_reference&code_event_index=6
# Display the first 20 entries of the DataFrame
print(df.head(20))
```
```text?code_stderr&code_event_index=6
Traceback (most recent call last):
  File "<string>", line 2, in <module>
NameError: name 'df' is not defined

```
```python?code_reference&code_event_index=10
print(df.head(20))
```
```text?code_stderr&code_event_index=10
Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'df' is not defined

```
```python?code_reference&code_event_index=14
import pandas as pd
from datasets import load_dataset

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Display the first 20 entries of the DataFrame
print(df.head(20))
```
```text?code_stderr&code_event_index=14
Traceback (most recent call last):
  File "<string>", line 2, in <module>
ModuleNotFoundError: No module named 'datasets'

```
```python
import pandas as pd
from datasets import load_dataset

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Display the first 20 entries of the DataFrame
print(df.head(20))
```

## Mastering the Basics of Dataset Handling

Building on the knowledge you've gained from loading and initially exploring the SMS Spam Collection dataset, it's time to bring everything together. In this task, you will start from scratch to implement the process of loading this dataset, converting it to a pandas DataFrame, and then displaying the first few entries. This exercise encapsulates the core skills of working with text data in Python that you have developed through this unit.

This practice is designed to ensure that you can confidently apply these fundamental data handling techniques on your own, preparing you for more advanced analysis and exploration in future units.


```python
# TODO: Import the pandas library

# TODO: Import the load_dataset function from the datasets library

# TODO: Load the SMS Spam Collection dataset using the load_dataset function

# TODO: Convert the loaded dataset into a pandas DataFrame

# TODO: Use the head() method on the DataFrame to preview the first few entries


```

```python?code_reference&code_event_index=2
# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Display the first 20 entries of the DataFrame
print(df.head(20))
```
```text?code_stderr&code_event_index=2
Traceback (most recent call last):
  File "<string>", line 2, in <module>
NameError: name 'load_dataset' is not defined

```
```python?code_reference&code_event_index=6
# Display the first 20 entries of the DataFrame
print(df.head(20))
```
```text?code_stderr&code_event_index=6
Traceback (most recent call last):
  File "<string>", line 2, in <module>
NameError: name 'df' is not defined

```
```python?code_reference&code_event_index=10
print(df.head(20))
```
```text?code_stderr&code_event_index=10
Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'df' is not defined

```
```python?code_reference&code_event_index=14
import pandas as pd
from datasets import load_dataset

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Display the first 20 entries of the DataFrame
print(df.head(20))
```
```text?code_stderr&code_event_index=14
Traceback (most recent call last):
  File "<string>", line 2, in <module>
ModuleNotFoundError: No module named 'datasets'

```
```python
import pandas as pd
from datasets import load_dataset

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Display the first 20 entries of the DataFrame
print(df.head(20))
```