# Long Input
At some point when training I realized taht we were bloating the network size because of a few clues containing a long number of words. We were padding every input to length 40 and each of those were going into the embedding table. Let's take a look at how many long clues we actually have. 

In [1]:
import pandas as pd
df = pd.read_csv("cleaned_data/clean_2.csv", keep_default_na=False)
df

Unnamed: 0,answer,clue
0,pat,"action done while saying ""good dog"""
1,rascals,mischief-makers
2,pen,it might click for a writer
3,sep,fall mo.
4,eco,kind to mother nature
...,...,...
770356,nat,actor pendleton
770357,shred,bit
770358,nea,teachers' org.
770359,beg,petition


In [2]:
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer('basic_english')

df['clue_tokens'] = df['clue'].apply(lambda x: len(tokenizer(x)))
df.sort_values(by='clue_tokens', ascending=False)

Unnamed: 0,answer,clue,clue_tokens
321600,abe,nickname of the man (born 2/12/1809) who gave ...,42
52900,oct,"mo. when the n.f.l., n.b.a., n.h.l. and m.l.b....",34
4578,seneca,"roman philosopher who said ""life is never inco...",34
100497,oprah,"who said ""i'm black. i don't feel burdened by ...",33
160005,sotomayor,"supreme court justice who once said ""i am a ne...",31
...,...,...,...
21685,ditto,"""",0
373061,inches,"""",0
155787,ditto,"""",0
6067,dittomarks,""" "" """,0


Interesting that some clues have zero tokens...

In [3]:
df[df['clue_tokens'] == 0]

Unnamed: 0,answer,clue,clue_tokens
6067,dittomarks,""" "" """,0
21685,ditto,"""",0
130662,inches,"""",0
155787,ditto,"""",0
275873,inches,"""",0
282584,quotes,""" """,0
373061,inches,"""",0
436433,dittos,""" "" "" "" """,0


In [4]:
len(df) - len(df[df['clue_tokens'] <= 10])

11479

So, if we remove all the clues longer than 10 tokens we can shrink our network without losing much data.

In [5]:
print(len(df))
df = df[df['clue_tokens'] > 0]
print(len(df))
df = df[df['clue_tokens'] <= 10]
print(len(df))

770361
770353
758874


In [6]:
df = df.drop(columns=['clue_tokens'])
df.to_csv('cleaned_data/dupes_10_or_less_tokens.csv', index=False)