# EDA on datasets - PART 1

### Keypoints from readme
1. The objective of the phrase level annotation task was to classify each example sentence into a positive, negative or neutral category by considering only the information explicitly available in the given sentence

2. annotators were asked to consider the sentences from the view point of an investor only; i.e. whether the news may have positive, negative or neutral influence on the stock price. As a result, sentences which have a sentiment that is not relevant from an economic or financial perspective are considered neutral.

3. covers a collection of 4840 sentences

4. Given the large number of overlapping annotations (5 to 8 annotations per sentence), there are several ways to define a majority vote based gold standard.

5. based on the strength of majority agreement: 

    (i) sentences with 100% agreement [file=Sentences_AllAgree.txt]; 
    
    (ii) sentences with more than 75% agreement [file=Sentences_75Agree.txt]; 
    
    (iii) sentences with more than 66% agreement [file=Sentences_66Agree.txt]; and 
    
    (iv) sentences with more than 50% agreement [file=Sentences_50Agree.txt]
    
6. sentence@sentiment
        E.g.,  The operating margin came down to 2.4 % from 5.7 % .@negative


7. sentiment is either "positive, neutral or negative"


### What are we doing here ?

1. Cleaning and preprocessing datasets
2. Create csv for each dataset

In [75]:
import pandas as pd
from collections import Counter
import os

#### 1. Clean and preprocess datasets

In [76]:
def clean_data(path):
    """
    Used to clean and seperate text data store it as a csv file.
    agreement column will have values like All ,75,66 and 50
    
    return the dataframe
    """
    # 1. read text file
    # 2.create pandas dataframe having 3 columns sentence , sentiment
    # 3.split using newline
    #     a. split using @ to get sentence and sentiment
    # 4. group it into two columns "sentence" and sentiment
    # 5.add new column agreement and add value based on column name 
    # 6.return dataframe
    
    df=pd.DataFrame(columns=["sentence","sentiment"])

    with open(path,"r") as text_file :
        for line in text_file:
            sentence,sentiment=line.split("@")
            df.loc[len(df.index)]=[sentence.strip(),sentiment.strip()]
           

    _,agreement=path.split("_")
    agreement_percent,_=agreement.split("Agree")
    agreement_percent=agreement_percent.strip()
    df["agreement"]=[agreement_percent]*len(df)
    return df

        
    

##### Sentences_AllAgree.txt

###### Findings
1. There were 5 duplicate sentences . Removed duplicates
2. first shape => (2264, 3)
3. final shape => (2259, 3) (ie, after duplicate removal)
4. neutral : 1386
5. positive: 570
6. negative: 303

Saved dataframe  : data/Sentences_AllAgree.txt.csv

In [77]:
path="data/Sentences_AllAgree.txt"
df =clean_data(path=path)

In [78]:
df.size , len(df), df.shape

(6792, 2264, (2264, 3))

In [79]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2264 entries, 0 to 2263
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentence   2264 non-null   object
 1   sentiment  2264 non-null   object
 2   agreement  2264 non-null   object
dtypes: object(3)
memory usage: 70.8+ KB


In [80]:
df.describe()


Unnamed: 0,sentence,sentiment,agreement
count,2264,2264,2264
unique,2259,3,1
top,SSH Communications Security Corporation is hea...,neutral,All
freq,2,1391,2264


In [81]:
df["sentence"].duplicated().sum()

np.int64(5)

In [82]:
duplicated_ids = df[df['sentence'].duplicated(keep=False)]

print(duplicated_ids)

                                               sentence sentiment agreement
518   The issuer is solely responsible for the conte...   neutral       All
519   The issuer is solely responsible for the conte...   neutral       All
625   The report profiles 614 companies including ma...   neutral       All
626   The report profiles 614 companies including ma...   neutral       All
928   Ahlstrom 's share is quoted on the NASDAQ OMX ...   neutral       All
929   Ahlstrom 's share is quoted on the NASDAQ OMX ...   neutral       All
1026  SSH Communications Security Corporation is hea...   neutral       All
1027  SSH Communications Security Corporation is hea...   neutral       All
1408  The company serves customers in various indust...   neutral       All
1409  The company serves customers in various indust...   neutral       All


In [83]:
df.drop_duplicates(inplace=True)

In [84]:
df["sentence"].duplicated().sum()

np.int64(0)

In [85]:
df.duplicated().sum()

np.int64(0)

In [86]:
df.shape

(2259, 3)

In [87]:
df.isna().sum()

sentence     0
sentiment    0
agreement    0
dtype: int64

In [88]:
df.describe()

Unnamed: 0,sentence,sentiment,agreement
count,2259,2259,2259
unique,2259,3,1
top,Sales in Finland decreased by 10.5 % in Januar...,neutral,All
freq,1,1386,2259


In [89]:
Counter(df["sentiment"])

Counter({'neutral': 1386, 'positive': 570, 'negative': 303})

In [90]:
csv_path=path+".csv"
if os.path.isfile(csv_path):
    print(f"File  exists : {csv_path} !!!")
else:
    print(f"created file : {csv_path}")
    df.to_csv(csv_path,index=False)

File  exists : data/Sentences_AllAgree.txt.csv !!!


##### Sentences_75Agree.txt

###### Findings
1. There were 5 duplicate sentences . Removed duplicates
2. first shape => (3453, 3)
3. final shape => (3448, 3) (ie, after duplicate removal)
4. neutral : 2141
5. positive: 887
6. negative: 420


Saved dataframe  : data/Sentences_75Agree.txt.csv

In [91]:
path="data/Sentences_75Agree.txt"
df =clean_data(path=path)

In [92]:
df.size , len(df), df.shape

(10359, 3453, (3453, 3))

In [93]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3453 entries, 0 to 3452
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentence   3453 non-null   object
 1   sentiment  3453 non-null   object
 2   agreement  3453 non-null   object
dtypes: object(3)
memory usage: 107.9+ KB


In [94]:
df.describe()


Unnamed: 0,sentence,sentiment,agreement
count,3453,3453,3453
unique,3448,3,1
top,SSH Communications Security Corporation is hea...,neutral,75
freq,2,2146,3453


In [95]:
df["sentence"].duplicated().sum()

np.int64(5)

In [96]:
duplicated_ids = df[df['sentence'].duplicated(keep=False)]

print(duplicated_ids)

                                               sentence sentiment agreement
773   The issuer is solely responsible for the conte...   neutral        75
774   The issuer is solely responsible for the conte...   neutral        75
958   The report profiles 614 companies including ma...   neutral        75
959   The report profiles 614 companies including ma...   neutral        75
1544  Ahlstrom 's share is quoted on the NASDAQ OMX ...   neutral        75
1545  Ahlstrom 's share is quoted on the NASDAQ OMX ...   neutral        75
1681  SSH Communications Security Corporation is hea...   neutral        75
1682  SSH Communications Security Corporation is hea...   neutral        75
2195  The company serves customers in various indust...   neutral        75
2196  The company serves customers in various indust...   neutral        75


In [97]:
df.drop_duplicates(inplace=True)

In [98]:
df["sentence"].duplicated().sum()

np.int64(0)

In [99]:
df.duplicated().sum()

np.int64(0)

In [100]:
df.shape

(3448, 3)

In [101]:
df.isna().sum()

sentence     0
sentiment    0
agreement    0
dtype: int64

In [102]:
df.describe()

Unnamed: 0,sentence,sentiment,agreement
count,3448,3448,3448
unique,3448,3,1
top,Sales in Finland decreased by 10.5 % in Januar...,neutral,75
freq,1,2141,3448


In [103]:
Counter(df["sentiment"])

Counter({'neutral': 2141, 'positive': 887, 'negative': 420})

In [104]:
csv_path=path+".csv"
if os.path.isfile(csv_path):
    print(f"File  exists : {csv_path} !!!")
else:
    print(f"created file : {csv_path}")
    df.to_csv(csv_path,index=False)

File  exists : data/Sentences_75Agree.txt.csv !!!


##### Sentences_66Agree.txt

###### Findings
1. There were 6 duplicate sentences . Removed duplicates
2. first shape => (4217, 3)
3. final shape => (4211, 3) (ie, after duplicate removal)
4. neutral : 2529
5. positive: 1168
6. negative: 514


Saved dataframe  : data/Sentences_66Agree.txt.csv

In [105]:
path="data/Sentences_66Agree.txt"
df =clean_data(path=path)

In [106]:
df.size , len(df), df.shape

(12651, 4217, (4217, 3))

In [107]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4217 entries, 0 to 4216
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentence   4217 non-null   object
 1   sentiment  4217 non-null   object
 2   agreement  4217 non-null   object
dtypes: object(3)
memory usage: 131.8+ KB


In [108]:
df.describe()


Unnamed: 0,sentence,sentiment,agreement
count,4217,4217,4217
unique,4211,3,1
top,The report profiles 614 companies including ma...,neutral,66
freq,2,2535,4217


In [109]:
df["sentence"].duplicated().sum()

np.int64(6)

In [110]:
duplicated_ids = df[df['sentence'].duplicated(keep=False)]

print(duplicated_ids)

                                               sentence sentiment agreement
959   The issuer is solely responsible for the conte...   neutral        66
960   The issuer is solely responsible for the conte...   neutral        66
1195  The report profiles 614 companies including ma...   neutral        66
1196  The report profiles 614 companies including ma...   neutral        66
2003  Ahlstrom 's share is quoted on the NASDAQ OMX ...   neutral        66
2004  Ahlstrom 's share is quoted on the NASDAQ OMX ...   neutral        66
2164  SSH Communications Security Corporation is hea...   neutral        66
2165  SSH Communications Security Corporation is hea...   neutral        66
2641  Proha Plc ( Euronext :7327 ) announced today (...   neutral        66
2642  Proha Plc ( Euronext :7327 ) announced today (...   neutral        66
2745  The company serves customers in various indust...   neutral        66
2746  The company serves customers in various indust...   neutral        66


In [111]:
df.drop_duplicates(inplace=True)

In [112]:
df["sentence"].duplicated().sum()

np.int64(0)

In [113]:
df.duplicated().sum()

np.int64(0)

In [114]:
df.shape

(4211, 3)

In [115]:
df.isna().sum()

sentence     0
sentiment    0
agreement    0
dtype: int64

In [116]:
df.describe()

Unnamed: 0,sentence,sentiment,agreement
count,4211,4211,4211
unique,4211,3,1
top,Sales in Finland decreased by 10.5 % in Januar...,neutral,66
freq,1,2529,4211


In [117]:
Counter(df["sentiment"])

Counter({'neutral': 2529, 'positive': 1168, 'negative': 514})

In [118]:
csv_path=path+".csv"
if os.path.isfile(csv_path):
    print(f"File  exists : {csv_path} !!!")
else:
    print(f"created file : {csv_path}")
    df.to_csv(csv_path,index=False)

File  exists : data/Sentences_66Agree.txt.csv !!!


##### Sentences_50Agree.txt

###### Findings
1. There were  8 duplicate sentences . Removed duplicates
2. first shape => (4846, 3)
3. final shape => (4840, 3) (ie, after duplicate removal)
4. neutral : 2873 
5. positive: 1363
6. negative:  604


Saved dataframe  : data/Sentences_50Agree.txt.csv

In [119]:
path="data/Sentences_50Agree.txt"
df =clean_data(path=path)

In [120]:
df.size , len(df), df.shape

(14538, 4846, (4846, 3))

In [121]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4846 entries, 0 to 4845
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentence   4846 non-null   object
 1   sentiment  4846 non-null   object
 2   agreement  4846 non-null   object
dtypes: object(3)
memory usage: 151.4+ KB


In [122]:
df.describe()


Unnamed: 0,sentence,sentiment,agreement
count,4846,4846,4846
unique,4838,3,1
top,Ahlstrom 's share is quoted on the NASDAQ OMX ...,neutral,50
freq,2,2879,4846


In [123]:
df["sentence"].duplicated().sum()

np.int64(8)

In [124]:
duplicated_ids = df[df['sentence'].duplicated(keep=False)]

print(duplicated_ids)

                                               sentence sentiment agreement
78    TELECOMWORLDWIRE-7 April 2006-TJ Group Plc sel...   neutral        50
79    TELECOMWORLDWIRE-7 April 2006-TJ Group Plc sel...  positive        50
788   The Group 's business is balanced by its broad...  positive        50
789   The Group 's business is balanced by its broad...   neutral        50
1098  The issuer is solely responsible for the conte...   neutral        50
1099  The issuer is solely responsible for the conte...   neutral        50
1415  The report profiles 614 companies including ma...   neutral        50
1416  The report profiles 614 companies including ma...   neutral        50
2395  Ahlstrom 's share is quoted on the NASDAQ OMX ...   neutral        50
2396  Ahlstrom 's share is quoted on the NASDAQ OMX ...   neutral        50
2566  SSH Communications Security Corporation is hea...   neutral        50
2567  SSH Communications Security Corporation is hea...   neutral        50
3093  Proha 

In [125]:
df.drop_duplicates(inplace=True)

In [126]:
df["sentence"].duplicated().sum()

np.int64(2)

In [127]:
df.duplicated().sum()

np.int64(0)

In [128]:
df.shape

(4840, 3)

In [129]:
df.isna().sum()

sentence     0
sentiment    0
agreement    0
dtype: int64

In [130]:
df.describe()

Unnamed: 0,sentence,sentiment,agreement
count,4840,4840,4840
unique,4838,3,1
top,TELECOMWORLDWIRE-7 April 2006-TJ Group Plc sel...,neutral,50
freq,2,2873,4840


In [131]:
Counter(df["sentiment"])

Counter({'neutral': 2873, 'positive': 1363, 'negative': 604})

In [132]:
csv_path=path+".csv"
if os.path.isfile(csv_path):
    print(f"File  exists : {csv_path} !!!")
else:
    print(f"created file : {csv_path}")
    df.to_csv(csv_path,index=False)

File  exists : data/Sentences_50Agree.txt.csv !!!


### Conclusion


##### Sentences_AllAgree.txt

###### Findings
1. There were 5 duplicate sentences . Removed duplicates
2. first shape => (2264, 3)
3. final shape => (2259, 3) (ie, after duplicate removal)
4. neutral : 1386
5. positive: 570
6. negative: 303


##### Sentences_75Agree.txt

###### Findings
1. There were 5 duplicate sentences . Removed duplicates
2. first shape => (3453, 3)
3. final shape => (3448, 3) (ie, after duplicate removal)
4. neutral : 2141
5. positive: 887
6. negative: 420


##### Sentences_66Agree.txt

###### Findings
1. There were 6 duplicate sentences . Removed duplicates
2. first shape => (4217, 3)
3. final shape => (4211, 3) (ie, after duplicate removal)
4. neutral : 2529
5. positive: 1168
6. negative: 514


##### Sentences_50Agree.txt

###### Findings
1. There were 8 duplicate sentences . Removed duplicates
2. first shape => (4846, 3)
3. final shape => (4840, 3) (ie, after duplicate removal)
4. neutral : 2873
5. positive: 1363
6. negative: 604