## DATA HACKERMAN FINAL PROJECT

### PART 2 - _File type manipulation and formatting_

Three files are presented, one CSV, one TXT and one JSON file. Each contain 1000 rows of data. There are two challenges, both involving collating these files into one data frame. The fields in all files are:

   - 'author.properties.friends',  'author.properties.status_count',  'author.properties.verified',  'content.body',  'location.country',  'properties.platform',  'properties.sentiment',  'location.latitude',  'location.longitude' where the ‘.’ Indicates a nested field.
 
a) Collate the CSV and TXT files together into one pandas dataframe (dataframe should be 2000 rows and have all of the columns present in both files)

b) Use the created dataframe, integrate the data from the JSON file into the existing columns. The resulting dataframe should now be 3000 rows long.

In [20]:
import pandas as pd
import json
import os

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.options.display.max_colwidth = None
pd.set_option("display.float_format", lambda x: '%.2f' % x)

from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

from data_ingestion.ingest import get_data
from parameters.params import csv_file_path, text_file_path, json_file_path, save_path 

### Read in Data

**CSV DATA**

In [3]:
csv_data = get_data(csv_file_path)
csv_data.sample(2)

Unnamed: 0,author.properties.friends,author.properties.status_count,author.properties.verified,content.body,location.country,properties.platform,properties.sentiment,location.latitude,location.longitude
231,27,1711,False,"@Jack_Septic_Eye I'm in Ireland, can we meet? ❤️✌🏻️",GB,twitter,-1,55.2,-6.25
209,144,11769,False,mans even bought new clobber and im still int dumps how poo,GB,twitter,-1,53.54,-2.65


In [4]:
csv_data.columns

Index(['author.properties.friends', 'author.properties.status_count',
       'author.properties.verified', 'content.body', 'location.country',
       'properties.platform', 'properties.sentiment', 'location.latitude',
       'location.longitude'],
      dtype='object')

In [5]:
csv_data.shape

(1000, 9)

**TXT DATA**

In [6]:
txt_data = get_data(text_file_path)
txt_data.sample(2)

Unnamed: 0,author.properties.friends,author.properties.verified,location.longitude,author.properties.status_count,properties.sentiment,location.latitude,location.country,content.body,properties.platform
186,794,False,-0.19,12978.0,1.0,50.84,GB,Nope :) https://t.co/eWbQNYTQOO,twitter
880,12406,False,-4.62,49176.0,1.0,55.47,GB,@cuddlememila love you 😘,twitter


In [7]:
txt_data.columns

Index(['author.properties.friends', 'author.properties.verified',
       'location.longitude', 'author.properties.status_count',
       'properties.sentiment', 'location.latitude', 'location.country',
       'content.body', 'properties.platform'],
      dtype='object')

In [8]:
txt_data.shape

(1000, 9)

**JSON DATA**

In [9]:
json_data = get_data(json_file_path)
json_data[0]

{'author': {'properties': {'friends': 150,
   'verified': False,
   'status_count': 583}},
 'location': {'longitude': -1.4496120000000003,
  'country': 'GB',
  'latitude': 53.38322877572023},
 'content': {'body': "To everyone tryin to snapchat me fuck off I'm ugly"},
 'properties': {'sentiment': -1, 'platform': 'twitter'}}

In [10]:
json_data_normalized = pd.json_normalize(json_data)
json_data_normalized.sample(2)

Unnamed: 0,author.properties.friends,author.properties.verified,author.properties.status_count,location.longitude,location.country,location.latitude,content.body,properties.sentiment,properties.platform
860,526,False,13108,-3.6,GB,55.98,Worst for putting stuff down then forgetting where I've put it,-1,twitter
142,243,False,13893,-1.22,GB,54.68,revenge is my favourite thing ever,1,twitter


In [11]:
json_data_normalized.shape

(1000, 9)

### Collating the CSV and TXT files together

In [12]:
csv_txt_combined = pd.concat([csv_data, txt_data])

In [13]:
csv_txt_combined.shape

(2000, 9)

In [14]:
csv_txt_combined[998:1002]

Unnamed: 0,author.properties.friends,author.properties.status_count,author.properties.verified,content.body,location.country,properties.platform,properties.sentiment,location.latitude,location.longitude
998,2445,3848.0,False,"Spent half my childhood watching music channels, never felt so nostalgic ahaha",GB,twitter,-1.0,51.18,-0.61
999,253,38802.0,False,Who is responsible for this because same https://t.co/L5q648stp4,GB,twitter,-1.0,52.63,-1.13
0,632,106490.0,False,@moel_bryn https://t.co/qvz1bI2Utb,GB,twitter,0.0,52.12,-2.32
1,278,31467.0,False,Who wants to rap battle with me on stream tomorrow 👀,GB,twitter,-1.0,51.6,-0.34


In [15]:
csv_txt_combined.columns

Index(['author.properties.friends', 'author.properties.status_count',
       'author.properties.verified', 'content.body', 'location.country',
       'properties.platform', 'properties.sentiment', 'location.latitude',
       'location.longitude'],
      dtype='object')

### Collating the CSV, TXT and JSON files together

In [16]:
csv_txt_json_combined = pd.concat([csv_txt_combined, json_data_normalized])
csv_txt_json_combined.shape

(3000, 9)

In [18]:
csv_txt_json_combined[1998:2002]

Unnamed: 0,author.properties.friends,author.properties.status_count,author.properties.verified,content.body,location.country,properties.platform,properties.sentiment,location.latitude,location.longitude
998,346,18689.0,False,@Amyyy14 thank u so much Amy you really get me ❤️ I come home tmrw let's get drinks,GB,twitter,1.0,54.79,-1.56
999,205,2271.0,False,Groves on points https://t.co/qHxgz4Fn8f,GB,twitter,-1.0,52.19,-1.7
0,150,583.0,False,To everyone tryin to snapchat me fuck off I'm ugly,GB,twitter,-1.0,53.38,-1.45
1,1321,86271.0,False,@cammiescott have you ever been to Scotland? You should give Nessie a wee visit! (I live near her) #askcamscott,GB,twitter,-1.0,57.79,-4.2


In [21]:
csv_txt_json_combined.to_csv(os.path.join(save_path,'combined_data.csv'), index=False)

In [22]:
#df.to_csv(os.path.join('myfolder','yourfilename.csv'))