In [234]:
import pandas as pd
import emoji
import html

## Data description

In [235]:
df = pd.read_csv('train.csv', encoding='latin-1')

Columns

- `id` - a unique identifier for each tweet
- `keyword` - a particular keyword from the tweet (may be blank) (can be null)
- `location` - the location the tweet was sent from (may be blank) (can be null)
- `text` - the text of the tweet
- `target` - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)


## Data preparation

people use angle brackets `< >` to indicate emotions

=> covert text like `&lt;gasp!&gt;` to `<gasp!>` (could give the model an idea of the sentiment in the text)

there are also some html encodings applied multiple times `House Energy &amp;amp; Commerce`

In [236]:
while True:
    converted_text = df["text"].apply(html.unescape)
    cnt = (converted_text != df["text"]).sum()
    print("lines converted: ", cnt)

    if cnt == 0:
        break

    df["text"] = converted_text


lines converted:  359
lines converted:  6
lines converted:  0


convert emojis in __text descriptors__ (`:emoji:`)

In [237]:
converted_text = df["text"].apply(emoji.demojize)# use line 2066 for html example

print("lines converted: ", (converted_text != df["text"]).sum())

df["text"] = converted_text

lines converted:  10


`t.co` is the __URL shortener service__ used by Twitter to shorten all links shared on its platform

does not provide a meaningfull information since the link is build like `http://t.co/*id*` => make a token for the link (preserves the context of a link existing)

In [238]:
df["text"] = df["text"].replace(r"http\S+", "<LINK>", regex=True)

reduce the mentions but __keep the context of a mention__

In [239]:
df["text"] = df["text"].replace(r"@\S+", "<MENTION>", regex=True)

remove duplicates after tokanization (`<LINK>`s or `<MENTION>`s, might replace distinct values, but the overall text is the same)

In [240]:
# check for duplicates and remove them
print("duplicated: ", df["text"].duplicated().sum())
print("df size: ", df.size)
df = df.drop_duplicates(subset=["text"])
print("duplicated: ", df["text"].duplicated().sum())
print("df size: ", df.size)

duplicated:  653
df size:  38065
duplicated:  0
df size:  34800


remove multiple spaces and new lines (less noise for classification)

In [241]:
df["text"] = df["text"].replace(r'\s+', ' ', regex=True)

there are still some encodings from the original text that are corrupted: `... Taiwan ÛÒ ...`

In [242]:
df["text"] = df["text"].str.encode('ascii', 'ignore').str.decode('ascii')

remove the URL encoding for a space character

In [243]:
df['keyword'] = df['keyword'].str.replace('%20', ' ')

In [244]:
df["location"].to_csv('output.csv', index=False)

In [245]:
total = len(df)
true_count = (df["target"] == 1).sum()
false_count = (df["target"] == 0).sum()

print("true instnces: {} ({})".format(true_count, true_count/total*100))
print("false instnces: {} ({})".format(false_count, false_count/total*100))

true instnces: 2855 (41.020114942528735)
false instnces: 4105 (58.97988505747126)


conclusion => the dataset is somewhat balanced

In [246]:
count_per_location = df.groupby("keyword")["id"].count()
print("unique keywords: ", count_per_location.size)
print("in average, {:.2f} instances per keyword".format(count_per_location.mean()))

unique keywords:  221
in average, 31.25 instances per keyword


In [247]:
df.groupby("keyword")[["target"]].count().sort_values("target", ascending=False)

Unnamed: 0_level_0,target
keyword,Unnamed: 1_level_1
fatalities,45
deluge,42
armageddon,42
damage,41
body bags,41
...,...
hijacking,13
epicentre,12
threat,11
inundation,10


conclusion => some keywords may hint the tweet is disaster because they have more true instances

In [248]:
count_per_location = df.groupby("location")["id"].count()
print("unique locations: ", count_per_location.size)
print("in average, {:.2f} instances per location".format(count_per_location.mean()))

unique locations:  3195
in average, 1.46 instances per location


conclusion => location is not really relevant in determining if the tweet is disaster or not (many locations are invented or self-typed, not really a relevant conclusion)

In [249]:
df.to_csv("train_prepared.csv", index=False, columns=["id", "keyword", "text", "target"])