# Data Cleanup: Removing Corporates' Customer Service Twitter accounts
The goal of this notebook is to show how I got rid of:
- All usernames belonging to corporate/business accounts who are involved into **customer service**;
- Twitter accounts that are into very specific topics (e.g. porn).
All these indeed are not very likely to be useful for the core analysis I'm interested in.

In [5]:
import operator

In [1]:
# Let's read back the node degrees that were dumped from the Notebook 5.NetworkMetrics.ipynb
node_degrees = []
with open("../data/temp/node_degrees.csv") as f:
    for line in f.readlines():
        pair = tuple([int(el) for el in line.strip('\n').split(',')])
        node_degrees.append(pair)

In [6]:
# Sort in descending order by degree
node_degrees = sorted(node_degrees, key=operator.itemgetter(1), reverse=True)

The end result will be a set of node IDs that will be removed from the MMR network, along with all their edges and the 0-degree nodes generated from this removal. As it will be explained later, I'll first collect all *potential* unwanted usernames in a separate set, then check which usernames actually meet some custom requirements in order to be classified as *really unwanted*.

In [9]:
dirty_data = set()
potential_dirty_data = set()

## 1. Extract usernames from List of Twitter Corporate Accounts
The Internet provided me with a convenient yet not properly structured and incomplete [list of Twitter Corporate Accounts](https://gist.github.com/mbejda/45db05ea50e79bc42016#file-fortune-1000-company-twitter-accounts-csv), that could help me instantly figure out which Twitter usernames actually belong to a company and not a private user. I cleaned up the original CSV file, came up with a list of 428 Twitter corporate usernames and created a new bash script `find_companies_usernames_from_list.sh` to actually check which corporate accounts are included in our data:
~~~bash
while IFS= read -r h
do
    egrep "^$h," -i -m 1 "$2" | tee -a ../data/found_companies.csv
done < "$1"
~~~
Therefore I ran the script with:
~~~bash
./find_companies_usernames_from_list.sh /path/to/twitter_companies.txt /path/to/usernames.csv
~~~
and got the `found_companies.csv` output file with **346 unique entries**. All these may be added to the `potential_dirty_data` set:

In [11]:
%%bash --out shell_output
# Get only the IDs to work on in Python
awk -F "," '{print $2}' ../data/found_companies.csv

In [12]:
companies_ids = [int(el) for el in shell_output.split('\n') if el != '']
potential_dirty_data.update(companies_ids)

## 2. Extract usernames that match common Customer Service patterns
Since the provided input list is not meant to be a full extensive list of *all* Twitter corporate accounts, there's still a lot of unwanted data that could be cleaned up. By looking at the top 20 nodes with highest degree shown before, I could easily point out that common username patterns are `*support`, `*care` or `*help` (e.g. `stccare`, `xboxsupport`, `amazonhelp` etc.). Therefore, I could further expand my check by identifying those usernames that match these patterns. In order to achieve this, I include these statements in the `find_companies_usernames.sh` script:
~~~bash
egrep ".*care,|.*support,|.*help," /path/to/usernames.csv | tee -a ../data/found_companies.csv
sort -u /path/to/found_companies.csv > /path/to/found_companies_nodup.csv # Remove duplicates
~~~
There are **14031** accounts that fall within this category, so I include them in the `found_companies.csv` file.

Now that I have a list of *14375 potential* corporate accounts I can state which ones actually represent noise in our data by checking their degree: if this is significantly high, let's say $k \gt 1000 \simeq 100*\langle K \rangle$, then I can conclude the concerned accounts should be filtered out.

**Warning**: the simple criteria that takes into consideration the name pattern and the degree of a node isn't necessarily sufficient to guarantee a 100\% match for a corporate account, although I do consider this trade-off pretty reasonable. The main drawback is excluding from the analysis a (relatively low) number of accounts that are reported as false positives; on the other hand, I would still not be able to detect *all* corporate accounts, because of the degree threshold limitation. Lowering the threshold too much would mean capturing more corporate accounts and at the same time more false positives.

In [204]:
%%bash --out shell_output
# Get only the IDs to work on in Python
awk -F "," '{print $2}' ../data/found_companies_nodup.csv

In [205]:
companies_ids = [int(el) for el in shell_output.split('\n') if el != '']

In [206]:
high_degrees = sorted([el for el in node_degrees if el[1] > 10*K], key=operator.itemgetter(1), reverse=True)

In [207]:
# Extract subset of node_degrees
companies_accounts_degrees = [el for el in high_degrees if el[0] in companies_ids]

In [208]:
node_ids = [el[0] for el in companies_accounts_degrees]
id_to_username_dict = get_multiple_usernames(node_ids)

# Show data formatted as DataFrame
data = [[id_to_username_dict[str(el[0])], el[0], el[1]] for el in companies_accounts_degrees]
show_data_as_dataframe(data, ["Username", "Node ID (Encoding)", "Degree"])

Unnamed: 0,Username,Node ID (Encoding),Degree
0,stccare,272020,31578
1,xboxsupport,112622,27853
2,americanair,38831,25676
3,btcare,46286,19399
4,indosatcare,64396,18911
5,amazonhelp,5513254,17815
6,telkomcare,735473,16169
7,tmobilehelp,247443,13218
8,vzwsupport,18919,12211
9,sprintcare,33051,11278


## 3. Run verification checks
description keywords: customer,customers,support,service,help,ask,team,care,information,helpteam,questions,concerns,inquiries,assistance,assist
Consider as potential company if any of the below requirement is met:
- verified status -> TRUE
- presence of at least 1 description keywords
- degree > 500 (~ 50 * $\langle k \rangle$)

*helpdesk
argoshelpers
orangeuk
orangehelpers
twitchhelpers
xboxsupport*
xbox
*assist
vodafone*
pizzahut
xbox*
hondacustsvc
cadillaccustsvc
vauxhallcustsvc
itau
ask*
cocanomc
mtv
ee
o2
*airways
*playstation*
*helps
*porn*
*cares
*movistar*
postnl
yogurstand
eatbulaga
sainsburys
telstra
ask*
mobily
southwestair
united
easyjet
aircanada
optus
optussport
skyresponde
skyhelpteam
asdaserviceteam
captainamerica
kenyapower
delta
deltaassist
tesco
tescomobile
telkomsel
virginmedia
erpestar
tacobell
alphabetsuccess
personalar
onrpe
klm