<a href="https://colab.research.google.com/github/worldterminator/mess/blob/main/trump.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Data import

In [3]:
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
file_path = '/content/drive/My Drive/trump.csv'
trump = pd.read_csv(file_path)

print(trump.head())

Mounted at /content/drive
   Durationinseconds                                      describephoto  \
0                 30                                                NaN   
1                142                                            rioting   
2                243   Terrorist acts against a duly elected government   
3                162        People are trying to overturn our democracy   
4                282  White supremacists under the guise of Qanon ar...   

                                          whynervous feedback     IP_country  \
0                                                NaN      NaN  United States   
1         Because it shows disregard for rule of law      NaN  United States   
2  Our president is encouraging a coup and has un...      NaN  United States   
3  I think they're trying to disregard the basis ...      NaN  United States   
4  It is worrying that the president has encourag...      NaN  United States   

   ID  
0  66  
1  67  
2  68  
3  69  
4 

#Flag suspicious responses (decribephoto, whynervous)

I kept codes used to flag by the number of characters, but it's better to first tokenize and flag by minimum words. I want to both check the length of responses and whether some of them, especially the shorter ones, fail to include relevant information. A contexualized classifier could be trained with dictionaries perhaps?

##by minimum characters


In [32]:
trump.dropna(subset=['describephoto'], inplace=True)  # drop rows without a response
# Basic text cleaning (lowercasing, removing punctuation)
trump['describephoto_clean'] = trump['describephoto'].str.lower().str.replace('[^\w\s]', '', regex=True)
#just doing this for the whynervous variable here
min_words = 2
trump['flag_shortdescription'] = trump['describephoto_clean'].apply(lambda x: len(x.split()) < min_words)
print(trump)

      Durationinseconds                                      describephoto  \
1                   142                                            rioting   
2                   243   terrorist acts against a duly elected government   
3                   162        people are trying to overturn our democracy   
4                   282  white supremacists under the guise of qanon ar...   
5                   243  the people in the photo are attempting to brea...   
...                 ...                                                ...   
1266               3196  they are being sprayed by police while seeming...   
1267                577                             protesting trumps loss   
1268                331  taking their country back and taking action to...   
1269               1366        reacting negatively to the election outcome   
1277                323                                   an act of terror   

                                             whynervous      fe

In [42]:
summary_stats = trump['flag_shortdescription'].describe()
counts = trump['flag_shortdescription'].value_counts()
print(summary_stats)
print(counts)

count      1178
unique        2
top       False
freq       1096
Name: flag_shortdescription, dtype: object
False    1096
True       82
Name: flag_shortdescription, dtype: int64


In [36]:
# to get the IDs, filter rows where 'flag_invalid' is True and print their corresponding IDs
shortdescription = trump[trump['flag_shortdescription'] == True]
shortdescription_ids = shortdescription['ID'].unique()
print(shortdescription_ids)
# these are the IDs that had one-word decription. from here, manually inspect them or continue classification

[  67   75  104  120  140  144  162  175  189  196  197  202  203  214
  215  218  236  340  358  363  364  371  386  389  390  401  411  445
  453  476  480  481  504  509  564  579  582  583  627  632  648  674
  685  715  719  720  729  773  817  821  831  849  851  931  959  964
  968  973  997 1002 1014 1015 1028 1047 1056 1073 1077 1092 1129 1137
 1148 1175 1177 1179 1188 1200 1211 1224 1242 1247 1272 1316]


In [38]:
specific_row = trump [trump['ID'] == 75]
print(specific_row)

   Durationinseconds describephoto  \
9                163    protesting   

                                          whynervous feedback     IP_country  \
9  No one cared about the riots that have gone on...      NaN  United States   

   ID                                   whynervous_clean  flag_invalid  \
9  75  no one cared about the riots that have gone on...         False   

                   whynervous_tokens  token_count  \
9  [one, cared, riots, gone, months]           12   

             whynervous_tokens_clean  token_count_clean describephoto_clean  \
9  [one, cared, riots, gone, months]                  5          protesting   

   flag_shortdescription  
9                   True  


In [39]:
trump.dropna(subset=['whynervous'], inplace=True)
# Basic text cleaning (lowercasing, removing punctuation)
trump['whynervous_clean'] = trump['whynervous'].str.lower().str.replace('[^\w\s]', '', regex=True)
#just doing this for the whynervous variable here
min_words = 2
trump['flag_shortwhynervous'] = trump['whynervous_clean'].apply(lambda x: len(x.split()) < min_words)
print(trump)

      Durationinseconds                                      describephoto  \
1                   142                                            rioting   
2                   243   terrorist acts against a duly elected government   
3                   162        people are trying to overturn our democracy   
4                   282  white supremacists under the guise of qanon ar...   
5                   243  the people in the photo are attempting to brea...   
...                 ...                                                ...   
1266               3196  they are being sprayed by police while seeming...   
1267                577                             protesting trumps loss   
1268                331  taking their country back and taking action to...   
1269               1366        reacting negatively to the election outcome   
1277                323                                   an act of terror   

                                             whynervous      fe

In [41]:
summary_stats = trump['flag_shortwhynervous'].describe()
counts = trump['flag_shortwhynervous'].value_counts()
print(summary_stats)
print(counts)

count      1178
unique        2
top       False
freq       1165
Name: flag_shortwhynervous, dtype: object
False    1165
True       13
Name: flag_shortwhynervous, dtype: int64


In [43]:
shortwhynervous = trump[trump['flag_invalid'] == True]
shortwhynervous_ids = shortwhynervous['ID'].unique()
print(shortwhynervous_ids)

[ 171  189  401  476  582  719  720  751  991 1060 1073 1148 1156]


##by minimum tokens

In [47]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

# download necessary NLTK data (*)
nltk.download('punkt')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# tokenize and clean text (remove punctuation and stop words)
def clean_tokenize(text):
    no_punctuation = text.translate(str.maketrans('', '', string.punctuation))
    # tokenizing
    tokens = word_tokenize(no_punctuation)
    # removing stop words and lowercasing
    return [word.lower() for word in tokens if word.lower() not in stop_words]

# apply the functions
trump['describephoto_tokens_clean'] = trump['describephoto_clean'].apply(clean_tokenize)
# now count the tokens again
trump['description_token_count_clean'] = trump['describephoto_tokens_clean'].apply(len)
# filter rows based on the new token count
data_filtered = trump[trump['description_token_count_clean'] >= 2]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [48]:
data_filtered.shape

(1067, 17)

In [49]:
#print out the flagged IDs
shortdescription_token = trump[trump['token_count_clean'] < 2]
shortdescription_token_ids = shortdescription_token['ID'].unique()

print(shortdescription_token_ids)

[ 104  171  189  190  381  401  476  582  719  720  725  751  779  790
  991 1060 1073 1095 1108 1148 1156 1224]


In [50]:
specific_row = trump [trump['ID'] == 719]
print(specific_row)

     Durationinseconds describephoto whynervous feedback     IP_country   ID  \
653                147           war        war     good  United States  719   

    whynervous_clean  flag_invalid whynervous_tokens  token_count  \
653              war          True             [war]            1   

    whynervous_tokens_clean  token_count_clean describephoto_clean  \
653                   [war]                  1                 war   

     flag_shortdescription  flag_shortwhynervous describephoto_tokens_clean  \
653                   True                  True                      [war]   

     description_token_count_clean  
653                              1  


In [51]:
# apply functions to whynervous
trump['whynervous_tokens_clean'] = trump['whynervous_clean'].apply(clean_tokenize)
# now count the tokens again
trump['token_count_clean'] = trump['whynervous_tokens_clean'].apply(len)
# filter rows based on the new token count
data_filtered = trump[trump['token_count_clean'] >= 2]

In [52]:
data_filtered.shape

(1156, 17)

In [53]:
#print out the flagged IDs
shortwhynervous_token = trump[trump['token_count_clean'] < 2]
shortwhynervous_token_ids = shortwhynervous_token['ID'].unique()

print(shortwhynervous_token_ids)

[ 104  171  189  190  381  401  476  582  719  720  725  751  779  790
  991 1060 1073 1095 1108 1148 1156 1224]


## by irrelevant semantic (train a model on specifics)
