# Sentiment Analysis: Part 1, AWS and EMP
### Collecting data from Amazon Web Service using Elastic Map Reducer.

## Background

We are working with a government health agency to create a suite of smart phone medical apps for use by aid workers in
developing countries. This suite of apps will enable the aid workers to manage local health conditions by facilitating
communication with medical professionals located elsewhere. The government agency requires that the app suite be 
bundled with one model of smart phone. This will help them to limit purchase costs and ensure uniformity when training
aid workers to use the device. 

## Objective 

We were given a short list of devices that are all capable of executing the app suite's functions, and we were asked 
to examine the prevalence of positive and negative attitudes toward these devices on the web. Our goal is to narrow 
this list down to one device by conducting a broad-based web sentiment analysis to gain insight into the attitudes 
toward the devices.

For the first part of the project, we will use the AWS Elastic Map Reduce (EMR) platform to run a series of Hadoop 
Streaming jobs that will collect large amounts of smart phone-related web pages from a massive repository of web data 
called the Common Crawl. Once this data has been gathered we will compile it into data matrix files for analysis. 

## Collect and Prepare the Data 

### General Approach

Our general approach to this project is to count words associated with sentiment toward these devices within relevant 
documents on the web. We then leverage this data and machine learning methods to look for patterns in the documents 
that enable us to label each of these documents with a value that represents the level of positive or negative 
sentiment toward each of these devices. We then analyze and compare the frequency and distribution of the sentiment 
for each of these devices.

In order to really gauge the sentiment toward these devices, we must do this on a very large scale. To that end, we 
use the cloud computing platform provided by Amazon Web Services (AWS) to conduct the analysis. The datasets we 
analyze will come from Common Crawl, an open repository of web crawl data that is stored on Amazon’s Public Data Sets.

We are provided with 
 
* Mapper Python script (Mapper.py), to examines and counts data from portions of the Common Crawl data.
[Mapper](https://github.com/snowlee26/Portfolio-/blob/master/Mapper.py)
* Reducer Python script (Reducer.py), to accumulates the analysis from the individual mapper jobs.
[Reducer](https://github.com/snowlee26/Portfolio-/blob/master/Reducer.py)
* Aggregation script (Concatenatepv3.py), to helps stitch together the raw output from the multiple job flows.
[Concatenate](https://github.com/snowlee26/Portfolio-/blob/master/concatenatepv3.py)

### Identify the data source - Common Crawl

Common Crawl crawls and archives the entire readable Internet once per month. The archived files are stored on Amazon 
Web Services N. Virginia S3. The crawl is split into 1000’s of roughly similar sized files which are then saved as 
WARC file type and gzipped (WARC stands for Web ARChive format). Each of these files has it’s own specific address and
we use these addresses as input with Amazon Web Services.

Because we are interested in sentiment mining, we will focus on using a subset of the WARC files that only contain 
text: WET. As a first step to getting our input addresses, we download the wet paths file for last month on [Common Crawl Blog](http://commoncrawl.org/connect/blog/).

* The wet.paths file consists 10s of thousands of addresses like below, and it was saved as BDF file. 
![](wet.paths.png)
* We added “s3://commoncrawl/” to the beginning of all the file addresses we intend to use as input, so that EWR will
recognize the addresses. 
![](wet.paths.aws.png)
* Set up three S3 buckets, a bucket that you will use for mapper and reducer scripts, a bucket for your output, and a 
bucket for debugging logs.  

### Run the EMR job flow using AWS CLI(Command Line Interface)

Since we are running very large jobs, we will use the command line interface for this task. This interface will be 
accessed from the 'terminal' in Mac OSX and 'Command Prompt' on a Windows machine. The Command Line Interface (CLI) in
AWS provide the ability to programmatically launch and monitor progress of running job flows.

We are provided with a CreateJson Python File (createJsonFiles.py). To run this file, we use the s3 addresses from the
BDF file and it creates JSON files in the end. 
[CreateJson](https://github.com/snowlee26/Portfolio-/blob/master/createJsonFilesPv3.py)

Steps to run the job:

* Select WET file addresses. 
* Copy the CreateJsonFiles.py Python script into the folder with the .bdf file and personalize the script.
  * Update the python script to have the correct S3 locations for your Mapper.py and Reducer.py files.
  * Update the python script to have the correct S3 address for your output bucket.
* Run the CreateJsonFiles.py Python File from the command shell to generate your json file.
  * code
![](code.png)
  * Output of the code:
![](json.png)
* Checking the validity of your .json file. In order for the CLI to correctly process, json file has to be formatted 
corrected or be structured in the correct manner. Here, we used [JSONLint](https://jsonlint.com/) to validate our file.
We did a little adjustment on the formating and now we are ready for the next step. 
* Run the the .json files from CLI to create a EMR Cluster.
  * code
    ![](createEMRcluster.png)
* Monitor your Cluster from AWS Console.

### Consolidate the results of the jobs

We need to aggregate the results of the streaming jobs we set up. This involves the following steps:
* Download the EMR output from S3 output bucket. We use CyberDuck to download all of the individual output folders to 
a single folder locates on the local machine. 
* Put concatenatepv3.py file in the same folder where we saved all the EMR output folders. Concatenatepv3.py will open
each of the EMR output folders and aggregate all of the part files into two .csv files.
* Open a command prompt and run the concatenatepv3.py script.
![](concatenatepv3.png)
* Rename 'concatenated_factors.csv' to LargeMatrix.csv for the machine learning process in the next step. 


## Final Dataset
**Here, we show the first few rows of the final dataset we collected from AWS.**

In [5]:
import pandas as pd
LargeMatrix = pd.read_csv('100wet.csv')
LargeMatrix.head()

Unnamed: 0,id,iphone,samsunggalaxy,sonyxperia,nokialumina,htcphone,ios,googleandroid,iphonecampos,samsungcampos,...,samsungperunc,sonyperunc,nokiaperunc,htcperunc,iosperpos,googleperpos,iosperneg,googleperneg,iosperunc,googleperunc
0,0,2,0,0,0,0,2,6,2,0,...,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**Here we finished the first part of the project. We collected and created a dataset containing the level of positive 
or negative sentiment toward each of these devices. We will use this final dataset for analysis in the next step.**