# Data Preprocessing Norwegian


### Info about the dataset

#### Resett
We decided to use a subset of 6000 of the collected comments from Resett from Fall 2019. All comments annotated as non-neutral earlier in the project thesis were kept, as well as a sample of 2000 neutral. This decision was due to the imbalance of topics since most of the posts were written against Muslims. Thus, the complete dataset for annotation consisted of 41 910 utterances. We present the amount of data removed in each step of preprocessing in Table 6.1.
**NB: Note that the data examination below shows that this is wrong. There are 4K neutral and 2K non-neutral kept**

#### Facebook
* Removed comments larger than 500 characters
* Removed comments shorter than 10 characters
* Exchanged Names with "Navn"
* Removed non-Norwegian instances and URLS
* Dropped duplicates
* Removed comments only consisting of names


#### Twitter
* Removed non-Norwegian instances
* Removed URLS, via-mentions 
* Dropped duplicates 
* Exhanced usernames with @USER 
* Removed # at the beginning of sentences 
* Removed tweets directly rom newspaper accounts (DN+ etc.) 



## Libraries

In [None]:
import pkg_resources
import subprocess
import sys
import os
import pandas as pd
from pandas_ods_reader import read_ods

import seaborn as sns
import matplotlib.pyplot as plt 

import nltk
import re, itertools
from string import punctuation
#from spellchecker import SpellChecker

import emoji

nltk.download('punkt')

import warnings
warnings.filterwarnings("ignore")

## Import data

In [None]:
path = os.getcwd()+ "/SvanesEA20/"

fb = pd.read_csv(path + "facebook_labelled.csv")
tw = pd.read_csv(path + "tweets_labelled.csv")
rs = pd.read_csv(path + "resett_labelled.csv")

names = ["Facebook", "Twitter", "Resett"]
dfs = [fb, tw, rs]

In [None]:
for i in range(len(dfs)):
    print(names[i] + ": ")
    display(dfs[i].head(2))
    print()

## Examine data

#### Maximum and minimum lengths of texts

In [None]:
print("------------")
print("Maximum lengths: ")
print("------------")
for i in range(len(dfs)):
    print()
    print(names[i] + ": ", dfs[i].text.map(lambda x: len(x)).max())

print()

print("------------")
print("Minimum lengths: ")
print("------------")
for i in range(len(dfs)):
    print()
    print(names[i] + ": ", dfs[i].text.map(lambda x: len(x)).min())

In [None]:
for i in range(len(dfs)):
    df = dfs[i]
    
    print(names[i] + ": ")
    #display(dfs[i].category.value_counts())
    
    df = df.loc[df['category'].isin([str(x) for x in range(1,6)])]
    display(df.category.value_counts())
    
    print("Length dataframe:", len(df))
    print()
 
    

In [None]:
length = sum([len(df) for df in dfs])
print(length)

In [None]:
11366 + 23784 + 5995

In [None]:
for df in dfs:
    print(df.text.map(lambda x: len(x))

In [None]:
dfs = [fb, tw, rs]

In [None]:
for df in dfs:
    df["length"] = df.text.map(lambda x: len(x))


In [None]:
for df in dfs: 
    df = df.loc[df['category'].isin([str(x) for x in range(1,6)])]
    print(len(df[df.length < 10]))
    

# Preprocessing 

## Functions

In [None]:
# Lowercase, remove irrelevant punctuation
def clean_text(string):
    return ''.join(c for c in string.lower()if c not in punctuation)

# Reduce orthographic lengthening to two characters
def remove_duplicates(string):
    cleaned = ''.join(''.join(s)[:2] for _, s in itertools.groupby(string))
    return cleaned

def sent_tokenize(string):
    return nltk.sent_tokenize(string, language = "norwegian")

def word_tokenize(string):
    return nltk.word_tokenize(string, language = "norwegian")

def convert_user(string):
    return re.sub(r'(@User|@USER)', 'Navn', string)

In [None]:
df = fb.text.apply(word_tokenize)

In [None]:
df.head()

In [None]:
length = df.apply(len)

In [None]:
length[length > 128]

## Clean text 

In [None]:
for df in dfs:
    df["cleaned_text"] = df.text.apply(clean_text).apply(remove_duplicates)

In [None]:
fb.head()

## Convert @USER to Navn

In [None]:
no["text"] = no["text"].apply(convert_user)
no["cleaned"] = no["cleaned"].apply(convert_user)

no[no.source == "tw"].head(3)

## Save dataframe to csv

In [None]:
df.to_csv(path + 'preprocessed.csv')

In [None]:
df = pd.read_csv(path + 'preprocessed.csv')

In [None]:
df.head(3)