# Keras Basics with Classifying Movie Reviews Examples 

In [5]:
import keras
print(f"Keras version: {keras.__version__}")

Keras version: 3.11.3


### 1. Importing the dataset

In [6]:
try:
    from keras.datasets import imdb
    print("Successfully imported imdb")
except ImportError:
    print("Error importing imdb")

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)


Successfully imported imdb


**The `imdb.load_data()` function** is designed to always return **one package** containing all four items at once:
- `train_data`
- `train_labels`
- `test_data`
- `test_labels`

**`num_words=10000`**
This is a **filter**.  
It tells the function: *"Only give me the numbers for the top 10,000 most frequent words."*

Any word that isn't in that top 10,000 (like a **rare name** or a **typo**) is marked as an **unknown word** and given a **special number**.  
This keeps the **vocabulary size manageable**.

---

In [7]:
print("train_data:", type(train_data), train_data.shape)
print("train_labels:", type(train_labels), train_labels.shape)
print("test_data:", type(test_data), test_data.shape)
print("test_labels:", type(test_labels), test_labels.shape)

train_data: <class 'numpy.ndarray'> (25000,)
train_labels: <class 'numpy.ndarray'> (25000,)
test_data: <class 'numpy.ndarray'> (25000,)
test_labels: <class 'numpy.ndarray'> (25000,)


The variables `train_data` and `test_data` are lists of reviews, each review being a list of word indices (encoding a sequence of words). 
`train_labels` and `test_labels` are lists of **0s** and **1s**, where 0 stands for "negative" and 1 stands for "positive".

---

In [8]:
print(type(train_data[1]), len(train_data[1]))
print(type(train_data[2]), len(train_data[2]))  
print(type(train_data[3]), len(train_data[3]))
print(type(train_data[24001]), len(train_data[24001]))  

<class 'list'> 189
<class 'list'> 141
<class 'list'> 550
<class 'list'> 158


The purpose of these lines is to **investigate the data**.  
Before we try to build a model, it's a good practice to **look at our data closely** to understand its structure.

Let's break down one line:  
```
print(type(train_data[1]), len(train_data[1]))
```

- **`train_data[1]`**: This selects a single review from our dataset — the second review, in this case *(since Python indexing starts at 0)*.

- **`type(...)`**: This tells us the data type of that review.  
  The output is:
  ```
  <class 'list'>
  ```
  This confirms each review is a list of numbers.

- **`len(...)`**: This tells us the length of that list, which is the number of words in that specific review.

The output shows lengths like:
```
189, 141, 550
```


So, the whole reason for running this code is to **prove two things**:
1. Each review is a simple list.  
2. More importantly, the reviews all have different lengths.

This confirms the problem we discussed:  
The data is **not uniform**, which is why we will need to **vectorize it**.

`Vectorizing` means converting movie reviews, which are currently lists of word IDs with different lengths, into a fixed-size numeric format that a machine learning model can understand. Since neural networks require inputs of the same shape, we must standardize the data. This can be done by one-hot encoding, where each review is turned into a long vector of 0s and 1s representing the presence of specific words, or by padding sequences, where all reviews are cut or padded with 0s to reach the same length. In simple terms, vectorizing makes all reviews uniform, like turning random puzzle pieces into identical blocks so the model can process them consistently.

---

In [None]:
print(type(train_data))

print(type(train_data[6]))

<class 'numpy.ndarray'>
<class 'list'>


`print(type(train_data))`

**What it does**: This checks the data type of the entire train_data variable.

**Why we do it**: The output is `<class 'numpy.ndarray'>`. This confirms that our 25,000 reviews are being held in a NumPy array, which is a special, high-performance kind of list that's standard for machine learning tasks. It's like checking the type of the main container.

`print(type(train_data[6]))`

**What it does**: This checks the data type of a single item inside the main container. train_data[6] gets the 7th review.

**Why we do it**: The output is `<class 'list'>`. This tells us that while the main container is a NumPy array, each individual review inside it is just a regular Python list.