![title](bw.JPG)

# Problem Statement

Societe Generale (SocGen) is a French multinational banking and financial services company. With over 1,54,000 employees, based in 76 countries, they handle over 32 million clients throughout the world on a daily basis.

They provide services like retail banking, corporate and investment banking, asset management, portfolio management, insurance and other financial services.

While handling customer complaints, it is hard to track the status of the complaint. To automate this process, SocGen wants you to build a model that can automatically predict the complaint status (how the complaint was resolved) based on the complaint submitted by the consumer and other related meta-data.

## Data Description
The dataset consists of three files: train.csv, test.csv and sample_submission.csv.

|Column|Description|
|------|------|
|Complaint-ID|Complaint Id|
|Date received|Date on which the complaint was received|
|Transaction-Type|Type of transaction involved|
|Complaint-reason|Reason of the complaint|
|Consumer-complaint-summary|Complaint filed by the consumer - Present in three languages :  English, Spanish, French|
|Company-response|Public response provided by the company (if any)|
|Date-sent-to-company|Date on which the complaint was sent to the respective department|
|Complaint-Status|Status of the complaint (Target Variable)|
|Consumer-disputes|If the consumer raised any disputes|


### Submission Format
Please submit the prediction as a .csv file in the format described below and in the sample submission file.

|Complaint-ID|Complaint-Status|
|------|------|
|Te-1|Closed with explanation|
|Te-2|Closed with explanation|
|Te-3|Closed with explanation|
|Te-4|Closed with non-monetary relief|
|Te-5|Closed with explanation|

### Evaluation
**The submissions will be evaluated on the f1 score with ‘weighted’ average.**

# Data Preprocessing

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input/c3cc8568-0-dataset"))

# Any results you write to the current directory are saved as output.

In [None]:
train1=pd.read_csv('../input/c3cc8568-0-dataset/train.csv')
test1=pd.read_csv('../input/c3cc8568-0-dataset/test.csv')

In [None]:
train1.isnull().sum()

## cleaning

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
def clean_text(raw_text):
    raw_text=raw_text.strip()
    try:
        no_encoding=raw_text.decode("utf-8-sig").replace(u"\ufffd", "?")
    except:
        no_encoding = raw_text
    letters_only = re.sub("[^a-zA-Z]", " ",no_encoding) 
    words = letters_only.lower().split()                             
    stops = set(stopwords.words("english")+stopwords.words("french")+stopwords.words("spanish"))                  
    meaningful_words = [w for w in words if not w in stops]
    stemmer = SnowballStemmer("english", ignore_stopwords=True)
    meaningful_words1=[stemmer.stem(word) for word in meaningful_words]
    return( " ".join( meaningful_words1 )) 

In [None]:
def dateSim(val):
    if val==0:
        return 1
    else:
        return 0

train=train1.copy()
train['Date-received']=pd.to_datetime(train['Date-received'])
train['Date-sent-to-company']=pd.to_datetime(train['Date-sent-to-company'])
train['diff'] = train['Date-sent-to-company'] - train['Date-received']
train['diff_days']=train['diff']/np.timedelta64(1,'D')
train['diff_year']=train['diff']/np.timedelta64(1,'Y')
train['diff_m']=train['diff']/np.timedelta64(1,'M')
# train['diff_w']=train['diff']/np.timedelta64(1,'W')
train['Company-response'].fillna('None',inplace=True)
train['Consumer-disputes'].fillna('Other',inplace=True)
train['Consumer-complaint-summary']=train['Consumer-complaint-summary'].apply(clean_text)
train['Complaint-reason']=train['Complaint-reason'].apply(clean_text)
train['isSameDay']=train['diff_days'].apply(dateSim)


train['Complaint-reasonLen']=train['Complaint-reason'].apply(len)
train['Consumer-complaint-summaryLen']=train['Consumer-complaint-summary'].apply(len)

train.drop(['Date-sent-to-company','Date-received','diff'],axis=1,inplace=True)

In [None]:
train.head()
train.to_csv('trainV1.csv',index=False)

In [None]:
import gc
gc.collect()

In [None]:
test=test1.copy()
test['Date-received']=pd.to_datetime(test['Date-received'])
test['Date-sent-to-company']=pd.to_datetime(test['Date-sent-to-company'])
test['diff'] = test['Date-sent-to-company'] - test['Date-received']
test['diff_days']=test['diff']/np.timedelta64(1,'D')
test['diff_year']=test['diff']/np.timedelta64(1,'Y')
test['diff_m']=test['diff']/np.timedelta64(1,'M')
# test['diff_w']=test['diff']/np.timedelta64(1,'W')
test['Company-response'].fillna('None',inplace=True)
test['Consumer-disputes'].fillna('Other',inplace=True)
test['Consumer-complaint-summary']=test['Consumer-complaint-summary'].apply(clean_text)
test['Complaint-reason']=test['Complaint-reason'].apply(clean_text)
test['isSameDay']=test['diff_days'].apply(dateSim)

test['Complaint-reasonLen']=test['Complaint-reason'].apply(len)
test['Consumer-complaint-summaryLen']=test['Consumer-complaint-summary'].apply(len)

test.drop(['Date-sent-to-company','Date-received','diff'],axis=1,inplace=True)
test.head()

In [None]:
test.to_csv('testV1.csv',index=False)