# TEAM NAME: Classification Predict
## Introduction
### Context
Many companies are built around lessening one’s environmental impact or carbon footprint. They offer products and services that are environmentally friendly and sustainable, in line with their values and ideals. They would like to determine how people perceive climate change and whether or not they believe it is a real threat. This would add to their market research efforts in gauging how their product/service may be received.

### The challenge

With this context, EDSA is challenging you during the Classification Sprint with the task of creating a Machine Learning model that is able to classify whether or not a person believes in climate change, based on their novel tweet data.

Providing an accurate and robust solution to this task gives companies access to a broad base of consumer sentiment, spanning multiple demographic and geographic categories - thus increasing their insights and informing future marketing strategies.

### Hypothesis
* Provide an interpretation of the target variable
* List out the features on which our target variable might depend
* Give a view about the problem based on domain knowledge

### Data and Library Imports
Now we will import the libraries required to perform:
* language manipulation
* data import, manipulation and visualisation

In [3]:
# Library imports

# language manipulation
import nltk # toolkit for language processing
from nltk.corpus import stopwords # redundant words

# Data manipulation and visualisation
import numpy as np # mathematical processing
import pandas as pd # data manipulation
import seaborn as sns # data visualisation
import matplotlib.pyplot as plt  # data visualisation
%matplotlib inline

import re # regular expressiosn

# set plot style
sns.set()


Next we will import the `test` and `train` data provided. Thereafter we will inspect the first 5 rows of the train data to get an understanding of the data. 

In [5]:
# Data importation
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [8]:
# first 5 rows of train data
train.head()

Unnamed: 0,sentiment,message,tweetid
0,1,PolySciMajor EPA chief doesn't think carbon di...,625221
1,1,It's not like we lack evidence of anthropogeni...,126103
2,2,RT @RawStory: Researchers say we have three ye...,698562
3,1,#TodayinMaker# WIRED : 2016 was a pivotal year...,573736
4,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",466954


In [9]:
# first 5 rows of test data
test.head()

Unnamed: 0,message,tweetid
0,Europe will now be looking to China to make su...,169760
1,Combine this with the polling of staffers re c...,35326
2,"The scary, unimpeachable evidence that climate...",224985
3,@Karoli @morgfair @OsborneInk @dailykos \nPuti...,476263
4,RT @FakeWillMoore: 'Female orgasms cause globa...,872928


#### Data Description
The collection of this data was funded by a Canada Foundation for Innovation JELF Grant to Chris Bauch, University of Waterloo. The dataset aggregates tweets pertaining to climate change collected between Apr 27, 2015 and Feb 21, 2018. In total, 43943 tweets were collected. Each tweet is labelled as one of the following classes:

Class Description
* 2 News: the tweet links to factual news about climate change
* 1 Pro: the tweet supports the belief of man-made climate change
* 0 Neutral: the tweet neither supports nor refutes the belief of man-made climate change
* -1 Anti: the tweet does not believe in man-made climate change

Variable definitions
- sentiment: Sentiment of tweet
- message: Tweet body
- tweetid: Twitter unique id

Let's confirm that the sentiment values conform to the description above.

In [None]:
# identify the sentiment values
list(train.sentiment.unique())

### Exploratory Data Analysis


array([ 1,  2,  0, -1], dtype=int64)

##### Hypothesis validation