# 1. Dataset Cleansing 

***

This notebooks includes the general scripts regarding the data cleansing which we applied over the datasets for the model development of fake news automatic detection


### 1.1 Script to extract or convert  attributes from JSON  News Content format to  dataframe

This script is using to extract multiple attributes of the news content from the JSON file  and filtered the domain names from the url attribute  and created a new dataframe.

This below script applied over the FakeNewsNet Dataset


***
### 1.2 Removing rows having missing values

We checked the data to find out any missing values available in the news text article and removed those having missing values 

**Sample Script**


***
### 1.3 Removing duplicate rows

We removed the duplicate rows of  news title and texts from the dataset

**Sample Script**

***

# 2. Text Data Preprocessing 


### Preprocessing with Cleaning Process


Preprocessing text is always required during a text based classification model creation . Since text data always appears to be in unstructured raw data and it is not possible to feed the same format directly to model creation. We should cleanse the data and make it to a proper convention.

We have experimented   the following pre-processing methods in our classification work:

- Removal any HTML content
- Remove URLs and numbers
- Removal of all kinds of date formats
- Removal of Punctuation 
- Conversion of lower case
- Replacing 2 or more consecutive whitespaces with a single one
- Removal of Stopwords
- Lemmatization
- Stemming
- POS tagging 


#### Helper Functions to preprocess the text  with all the operations as mentioned above

In [None]:
# Function to tag POS tagging for each news content
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag,wordnet.NOUN)




def text_preprocess(text):

    # Remove HTML tags
    bsoup = BeautifulSoup(text, "html.parser")
    clean_text = bsoup.get_text()
    
    
    # Remove any URL
    url = re.compile(r'https?://\S+|www\.\S+')
    url.sub(r'', clean_text)
    
    # Remove any numbers
    clean_text=re.sub(r'\d+','',clean_text)
    
    # Remove all kinds of date formats
    clean_text=re.sub(r'\d+[\.\/-]\d+[\.\/-]\d+', '', clean_text)
    clean_text=re.sub(r'\d+[\.\/-]\d+[\.\/-]\d+', '', clean_text)
    clean_text=re.sub(r'\d+[\.\/-]\d+[\.\/-]\d+', '', clean_text)
    
    
    # Removal of punctuation and lower case conversion
    clean_text = re.sub('\[[^]]*\]', ' ', clean_text)
    clean_text = re.sub('[^a-zA-Z]',' ',clean_text)  # replaces non-alphabets with spaces
    clean_text = clean_text.lower()
    
    # Removal of 2 consecutive double space
    clean_text=re.sub(r' {2,}',' ',clean_text)
    
    # Removal of stop words
    word_tokens = word_tokenize(clean_text) 
    #stop_words = set(stopwords.words('english'))

    newcleantext = [w.strip() for w in word_tokens if w not in total_stop_words_list and len(w) > 2] 
    
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    newcleantext= [lemmatizer.lemmatize(w,get_wordnet_pos(w)) for w in newcleantext]
    
    ### Again applying the removal of stop words
     
    newcleantext = [w.strip() for w in newcleantext if w not in total_stop_words_list and len(w) > 2] 
    
    ## Removing duplicates
    newcleantext = sorted(set(newcleantext), key=lambda x:newcleantext.index(x))
    
    return newcleantext