# My Approach: Heuristic Intent Distribution Exploration
I need to get an insight into the true intents inside my Twitter data. Doing it by keyword might prove to be a good baseline way to do this. 


Using keywords as a starting point might offer a strong foundational approach for this task. I'm building upon this concept by employing heuristic clustering to organize my intents, aiming to minimize overlaps between them. The goal is to distill a set of intents that are both distinct and exclusive, enhancing Eve bot's ability to differentiate between them.

I drew inspiration from observing other solutions, particularly the "semantic fingerprint" concept implemented by cortex, although the specific workings were undisclosed. This prompted me to devise my approach. Initially, I considered branching off into manual selection of 1000 examples, but I realized this was impractical and overly laborious.

This notebook serves as a strategy for generating training data for intent classification.

In [None]:
import pandas as pd
print(f'pandas: {pd.__version__}')
import numpy as np
print(f'numpy: {np.__version__}')
# Visualization 
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="ticks", color_codes=True)
# Making my visualizations pretty
sns.set_style('whitegrid')
# Combination exploration
import itertools
import yaml

# Loading back processed data
processed = pd.read_pickle('objects/processed.pkl') # Load data from 1.0_ipynb file 
print(f'\ninbound:\n{processed.head()}')

## Brief Keyword Search EDA
Using this as a tool to look at Tweets.

In [None]:
# Search by keywords (single keyword filter)
keyword = 'info'

# Seeing what the processed Tweets look like
filt = [(i, comment_txt) for i, comment_txt in enumerate(processed["Processed Inbound"]) if keyword in comment_txt]
filtered = processed.iloc[[i[0] for i in filt]]
print(f'{len(filtered)} Tweets contain the keyword {keyword}')
filtered

In [None]:
 # code script for donwloading table data from wikipedia page

def calculate_indirect_features(df, eng_stopwords):
    """
    Calculate indirect features for a DataFrame based on the 'comment_text' column.

    Parameters:
    df (pandas.DataFrame): DataFrame containing the data.
    eng_stopwords (set): Set of English stopwords.

    Returns:
    df (pandas.DataFrame): DataFrame with the new features added.
    """

    # Sentence count in each comment
    df['count_sent'] = df["comment_text"].apply(lambda x: len(re.findall("\n", str(x))) + 1)

    # Word count in each comment
    df['count_word'] = df["comment_text"].apply(lambda x: len(str(x).split()))

    # Unique word count
    df['count_unique_word'] = df["comment_text"].apply(lambda x: len(set(str(x).split())))

    # Letter count
    df['count_letters'] = df["comment_text"].apply(lambda x: len(str(x)))

    # Punctuation count
    df["count_punctuations"] = df["comment_text"].apply(lambda x: len([c for c in str(x) if c in string.punctuation]))

    # Upper case words count
    df["count_words_upper"] = df["comment_text"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))

    # Title case words count
    df["count_words_title"] = df["comment_text"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))

    # Number of stopwords
    df["count_stopwords"] = df["comment_text"].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))

    # Average length of the words
    df["mean_word_len"] = df["comment_text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))

    return df

In [None]:
# Post-Hoc Intents: I find the keywords that is associated with a particular intent and search based on these keywords 

# Version 1: Initial Iteration: Make a dictionary to store intents and the predefined responses 

intents = ["Greeting": ["Hi there!", "Hello"], 
           "Closing": "Thanks for talking", 
           "Promotion": "", 
           "Complaint": "",
            "Scenarios": {"Last Payment": "", "Account Details": "", 
                          "Account Confirmation: ""}, 
            "Location": ""}
          ]

# Version-2: For showing progress 
intents = {'greeting': ['hi', 'hello', 'hey','yo'], 'app': ['app', 'application'],
          'iphone': ['iphone', 'i phone'], 'icloud': ['icloud', 'i cloud'],
          'ios': ['io'], 'battery': ['battery'], 'watch': ['watch'], 'mac': 
           ['mac', 'macbook', 'laptop', 'computer'], 'update': ['update'],
          'troubleshooting': ['problem', 'trouble', 'error'],
          'settings': ['settings', 'setting'], 'music': ['music', 'song', 'playlist'],
          'payment': ['credit','card','payment','pay'], 'bug':['bug'], 'watch': ['tv', 'show'],
          'network': ['internet','connection','network']}
          
# Intents that require all words within it to be contained in the list (alternative filtering method)
intents_all = {'ios update': ['io', 'update'], 'app update': ['app','update']}

# Version 3
intents = {'update': ['update'], 'battery': ['battery', 'power'], 'forgot_password':['password','account','login'],
          'repair':['repair','fix','broken'],  
           'payment': ['credit','card','payment','pay']}

# Storing it to YAML file
with open('objects/intents.yml', 'w') as outfile:
    yaml.dump(intents, outfile, default_flow_style=False)

print('INTENTS FOR KEYWORD EDA BELOW:\n ------------------------')
for i in intents.items():
    print('Intent: {} \n   Keywords: {}'.format(i[0], " + ".join(i[1])))

## Useful functions for Intent Classification

In [None]:
def get_key_tweets(series, keywords): 
    '''
    Takes as input the list of keywords and outputs the Tweets that contains at least
    one of these keywords
    '''
    keyword_tweets = []
    for tweet in series:
        for word in keywords:
            if word in tweet:
                keyword_tweets.append(tweet)
    return keyword_tweets

def get_key_tweets_all(series, keywords):
    '''
    Takes as input the list of keywords and outputs the Tweets that contains all
    of these keywords
    '''
    keyword_tweets = []
    for tweet in series:
        if all(word in tweet for word in keywords):
            keyword_tweets.append(tweet)
    return keyword_tweets