# Load Data

We use pandas to load the caption data from a text file. The file is tab-separated, and it doesn't contain a header row. Therefore, we explicitly specify the column names as "Template", "val", and "Caption" when loading the data into a DataFrame.


In [10]:
# Load data
import pandas as pd
df = pd.read_csv('../memes900k/captions.txt', sep='\t', header=None, names=["Template", "val", "Caption"])

# Check Data Distribution Among Templates

We examine the distribution of data across different meme templates to identify any inconsistencies that may affect the analysis. This is done by counting the occurrences of each template within the dataset.


In [11]:
df["Template"].value_counts()

Template
Y U No                                                                      3000
Buddy the Elf                                                               3000
Office Space Boss                                                           3000
Uncle Dolan                                                                 3000
Pepperidge Farm Remembers FG                                                3000
                                                                            ... 
skyrim stan                                                                 2964
Grumpy Cat                                                                  2960
Art Student Owl                                                             2959
Chronic Illness Cat                                                         2952
You keep using that word, I don't think it means what you think it means    2916
Name: count, Length: 300, dtype: int64

# Encoding the Labels

For the classification model to process our data, we need to convert the categorical labels into numerical form. This is achieved by creating a dictionary that maps each unique template label to a unique index. This dictionary will be used to encode the labels in the dataset.


In [12]:
# Encoding the labels 
possible_labels = df.Template.unique()

label_dict = {} 
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index 


# Apply Label Encoding

We apply the previously created label dictionary to transform the 'Template' column into a new 'label' column. This column contains numerical values corresponding to each template, facilitating the use of these labels in machine learning models.


In [13]:
df['label'] = df.Template.replace(label_dict)

%store label_dict

Stored 'label_dict' (dict)


  df['label'] = df.Template.replace(label_dict)


# Train and Validation Split

To prepare our data for model training and evaluation, we split it into training and validation sets using Scikit-learn's `train_test_split` function. We allocate 20% of the data to the validation set. The split is stratified by the label column to ensure that both sets have a representative distribution of each class.

After the split, we add a new column `data_type` to the DataFrame, initially setting all values to 'not_set'. We then update this column to 'train' or 'val' based on the indices from our split, indicating whether each entry belongs to the training set or the validation set.

Finally, we remove the 'val' column from the DataFrame, which is no longer needed, and display the count of entries for each combination of template, label, and data type, to verify our splits.


In [14]:
# Train and validation split 
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                  df.label.values, 
                                                  test_size=0.20, 
                                                  random_state=42, 
                                                  stratify=df.label.values)

df['data_type'] = ['not_set']*df.shape[0]

df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

df = df.drop('val', axis=1)
df.groupby(['Template', 'label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Caption
Template,label,data_type,Unnamed: 3_level_1
1889 [10] guy,102,train,2400
1889 [10] guy,102,val,600
Advice Dog,76,train,2400
Advice Dog,76,val,600
Advice Hitler,110,train,2400
...,...,...,...
you have no power here,256,val,600
you mean to tell me black kid,196,train,2399
you mean to tell me black kid,196,val,600
your country needs you,167,train,2400


# Data Sampling

To optimize the training time and manage computational resources, we reduce the dataset size by sampling. Each 'Template' group in the DataFrame is halved by selecting 50% of its entries. This sampling is performed with a consistent random seed to ensure reproducibility.

The process involves:
- Grouping the DataFrame by 'Template' and applying a function to sample 50% of the data from each group.
- Resetting the index of the sampled DataFrame to maintain order and consistency.
- Displaying the count of entries for each combination of 'Template', 'label', and 'data_type' to verify the distribution in the sampled dataset.

This approach balances the trade-off between computational efficiency and the accuracy of the model.


In [15]:
import pandas as pd

# Group the dataframe by 'Template' and sample 10% of the data for each group
sampled_data = df.groupby('Template').apply(lambda x: x.sample(frac=0.5, random_state=42)).reset_index(drop=True)

# Reset the index of the sampled dataframe
sampled_data = sampled_data.reset_index(drop=True)

# Print the sampled dataframe
sampled_data.groupby(['Template', 'label', 'data_type']).count()

%store sampled_data


  sampled_data = df.groupby('Template').apply(lambda x: x.sample(frac=0.5, random_state=42)).reset_index(drop=True)


Stored 'sampled_data' (DataFrame)


In [16]:
%store df

Stored 'df' (DataFrame)


# Data Preparation: Splitting Data into Train and Validation Sets

To organize our data for training and validation, we create lists `train_data` and `val_data` to store captions and labels based on their respective data types ('train' or 'val').


In [17]:
# Create lists to store captions and labels
train_data = []
val_data = []

# Iterate over the DataFrame
for index, row in df.iterrows():
    template = row['Template']
    caption = row['Caption']
    data_type = row['data_type']
    
    # Append the template and caption data to the appropriate list based on data_type
    if data_type == 'train':
        train_data.append((template, caption))
    elif data_type == 'val':
        val_data.append((template, caption))

# Print the length of the train and val data
print("Train data length:", len(train_data), train_data[:5])
print("Val data length:", len(val_data), val_data[:5]) 

Train data length: 719473 [('Y U No', 'commercial <sep> y u no same volume as show!?'), ('Y U No', 'Victoria <sep> y u no tell us your secret?!'), ('Y U No', 'KONY <sep> Y u no take justin bieber'), ('Y U No', 'TED <sep> y u no tell us how you met their mother'), ('Y U No', 'universal remote <sep> y u no work on universe?')]
Val data length: 179869 [('Y U No', 'Google <sep> Y U NO LET ME FINISH TYPING?'), ('Y U No', 'i held the door <sep> y u no say thank you'), ('Y U No', 'Team rocket <sep> y u no catch a different pikachu?'), ('Y U No', 'Y u no guy <sep> y u sound asian in my head?'), ('Y U No', 'hands <sep> y u no have same amount of fingers?')]


In [18]:
from rake_nltk import Rake
rake = Rake()

def get_keywords(text):
    rake.extract_keywords_from_text(text)
    return rake.get_ranked_phrases()

# Data Inspection

In this section, we define a process to iterate over the dataset in order to find anomalies that causes the data to be misinterpreted due to incorrect separators. This affects the quality of the models we train, hence cleaning the dataset to a right amount is necessary. 

In [19]:
# Data inspection 
train_captions = []
train_classes = []
train_keywords = []

val_captions = []
val_classes = []
val_keywords = []

# Iterate over the train_data list
for template, caption in train_data:
    train_captions.append(caption)
    train_classes.append(template)
    kw = get_keywords(caption)
    train_keywords.append(kw)

# Iterate over the val_data list
for template, caption in val_data:
    val_captions.append(caption)
    val_classes.append(template)
    kw = get_keywords(caption)
    val_keywords.append(kw)

# Print the length of the train and val data
print("Train data length:", len(train_captions), len(train_classes), len(train_keywords))
print("Val data length:", len(val_captions), len(val_classes), len(val_keywords))


all_captions = train_captions + val_captions    
largest_word_count = 0
longest_strings = []
smallest_word_count = float('inf')
shortest_string = ""

for string in all_captions:
    word_count = len(string.split())
    if word_count > largest_word_count:
        largest_word_count = word_count
        longest_strings = [string]
    elif word_count == largest_word_count:
        longest_strings.append(string)

# Sort the longest strings by length
longest_strings.sort(key=len, reverse=True)

# Print the top 10 longest strings
print(f"The top 10 strings with the largest number of words have {largest_word_count} words each:")
for i, string in enumerate(longest_strings[:10]):
    print(f"{i+1}. {string}")

# print(f"\nThe string with the smallest number of words has {smallest_word_count} words:")
# print(shortest_string)

Train data length: 719473 719473 719473
Val data length: 179869 179869 179869
The top 10 strings with the largest number of words have 257 words each:
1. COMPREI CAPINHAS NOVAS PRO MEU IPHONE 5 <sep> UI, ELA TEM IPHONE 5
aysi	0	I'M A 10 PLAYA <sep> A 10
aysi	0	10... <sep> THAT IS ALL
aysi	0	i ain't one to gossip <sep> so you didn't hear it from me
aysi	0	sly remarks at mods <sep> watch out we got a badass over here
aysi	0	I HON DEINA KORTN ET GSEGN! <sep> <emp>
aysi	0	HAY SI HAY SI <sep> SOY EL GERENTE PLANTHOM
aysi	0	FOUND IT!!! <sep> <emp>
aysi	0	OKAY, OKAY I AM <sep> INASINT
aysi	0	STAND BACK <sep> YOUR BLACK
aysi	0	12 YEAR OLDS SMOKING WEED <sep> WE GOT SOME BADASS IS OVER HERE
aysi	0	AY SI! AY SI! LE VOY A LOS LONGHORNS! <sep> Y NI ACABE LA HIGH SCHOOL!
aysi	0	WOOHHHHH <sep> BADASS BLACK BOY OVER HERE
aysi	0	IT AINT MY FAULT <sep> YOU DONT READ YOUR BIBLE I DIDNT SAY FOLLOW ME YOU DID THAT
aysi	0	AY SI AY SI ESTOY FELIZ... <sep> PORQUE TENGO BICICLETA NUEVA
aysi	0	KI E STATO A LAS