![title](bw.JPG)

# Problem Statement

Societe Generale (SocGen) is a French multinational banking and financial services company. With over 1,54,000 employees, based in 76 countries, they handle over 32 million clients throughout the world on a daily basis.

They provide services like retail banking, corporate and investment banking, asset management, portfolio management, insurance and other financial services.

While handling customer complaints, it is hard to track the status of the complaint. To automate this process, SocGen wants you to build a model that can automatically predict the complaint status (how the complaint was resolved) based on the complaint submitted by the consumer and other related meta-data.

## Data Description
The dataset consists of three files: train.csv, test.csv and sample_submission.csv.

|Column|Description|
|------|------|
|Complaint-ID|Complaint Id|
|Date received|Date on which the complaint was received|
|Transaction-Type|Type of transaction involved|
|Complaint-reason|Reason of the complaint|
|Consumer-complaint-summary|Complaint filed by the consumer - Present in three languages :  English, Spanish, French|
|Company-response|Public response provided by the company (if any)|
|Date-sent-to-company|Date on which the complaint was sent to the respective department|
|Complaint-Status|Status of the complaint (Target Variable)|
|Consumer-disputes|If the consumer raised any disputes|


### Submission Format
Please submit the prediction as a .csv file in the format described below and in the sample submission file.

|Complaint-ID|Complaint-Status|
|------|------|
|Te-1|Closed with explanation|
|Te-2|Closed with explanation|
|Te-3|Closed with explanation|
|Te-4|Closed with non-monetary relief|
|Te-5|Closed with explanation|

### Evaluation
**The submissions will be evaluated on the f1 score with ‘weighted’ average.**

# Prediction and Evaluation

In [4]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os

# print(os.listdir("../input/brainwavesml/c3cc8568-0-dataset"))

# Any results you write to the current directory are saved as output.

In [5]:
train1old=pd.read_csv('train.csv')
test1old=pd.read_csv('test.csv')

In [None]:
# train1=pd.read_csv('../input/fork-of-brainwaves-best-d-ata/trainV1.csv')
# test1=pd.read_csv('../input/fork-of-brainwaves-best-d-ata/testV1.csv')

In [None]:

test1old['Date-sent-to-company']=pd.to_datetime(test1old['Date-sent-to-company'])
test1['day']=test1old['Date-sent-to-company'].dt.day
test1['year']=test1old['Date-sent-to-company'].dt.year
test1['month']=test1old['Date-sent-to-company'].dt.month
test1.head()

In [None]:

train1old['Date-sent-to-company']=pd.to_datetime(train1old['Date-sent-to-company'])
train1['day']=train1old['Date-sent-to-company'].dt.day
train1['year']=train1old['Date-sent-to-company'].dt.year
train1['month']=train1old['Date-sent-to-company'].dt.month
train1.head()

In [None]:
wt=dict(1-train1['Complaint-Status'].value_counts()/train1.shape[0])
wt

In [None]:
train1.isnull().sum()

** Tried translating the different languages to english but google api seems to have a limit**

In [9]:
from googletrans import Translator
def clean_translate(raw_text):
    translator = Translator()
#     print("original#### ",raw_text[:70])
    try:
        if translator.detect(raw_text).lang!='en':
            trans=translator.translate(raw_text).text
        else:
            trans=raw_text
    except:
        trans=raw_text
#     print("trans#### ",trans[:70])
    return trans



In [10]:
con_com_sum=train1old['Consumer-complaint-summary'].apply(clean_translate)

In [13]:
consumer_compl=pd.DataFrame(data=con_com_sum,index=train1old.index)
consumer_compl.head()

Unnamed: 0,Consumer-complaint-summary
0,"Seterus, Inc. filed a false report with the ma..."
1,XX / XX / XXXX Bankruptcy Claim XXXX of Chapte...
2,"XXXX / XXXX / 15, I was preparing the flight b..."
3,"The loan was paid in XXXX XXXX. In XXXX, 4 yea..."
4,I got a care credit account for XXXX. Immediat...


In [14]:
consumer_compl.to_csv('consumer_compl.csv',index=False)

In [15]:
con_com_sumtest=test1old['Consumer-complaint-summary'].apply(clean_translate)


Unnamed: 0,Consumer-complaint-summary
0,XXXX / XXXX / 16 I called Citibank to open a c...
1,I'm struggling financially. I called and I off...
2,"In XXXX of 2015, an automatic payment was conf..."
3,"I submitted a request to XXXX, which is my cur..."
4,A state tax lien was filed against me XXXX / X...


In [None]:
consumer_compltest=pd.DataFrame(data=con_com_sumtest,index=test1old.index)
consumer_compltest.head()

In [None]:
consumer_compltest.to_csv('consumer_compltest.csv',index=False)

In [None]:
# import py-translate
# translator = Translator()
# from nltk.misc import babelfish
# smpl=train1['Consumer-complaint-summary'].sample(1,random_state=1994).values
# print(smpl)

# [w for w in smpl if not w in set(stopwords.words("french")) ]
# babelfish.translate(smpl)
# print(translator.translate(smpl))
# train1['Consumer-complaint-summary'].sample(1,random_state=1994).apply(clean_text)

In [None]:
train1['Complaint-reason'].fillna('Other',inplace=True)
train1['Consumer-complaint-summary'].fillna('Other',inplace=True)

In [None]:
train=train1.copy()


In [None]:
import gc
gc.collect()
train.head()

In [None]:
train.describe(include='all').T

In [None]:
import seaborn as sns
%matplotlib inline
# train['Consumer-complaint-summaryLen'].plot.bar()
train.columns
# feat=[ 'diff_days', 'diff_year', 'diff_m',
#        'isSameDay', 'Complaint-reasonLen', 'Consumer-complaint-summaryLen']

In [None]:
# train['combine']=train['Complaint-reason']+train['Consumer-complaint-summary']

In [None]:
# from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

# # vec_cr = TfidfVectorizer(ngram_range=(1,2),stop_words="english", analyzer='word')
# # comp_reason =vec_cr.fit_transform(train['Complaint-reason'])
# # vec_cs = TfidfVectorizer(ngram_range=(1,3),stop_words="english", analyzer='word')
# # consum_comp_sum =vec_cs.fit_transform(train['Consumer-complaint-summary'])

# vec_cs = TfidfVectorizer(ngram_range=(1,3),stop_words="english", analyzer='word')
# consum_comp_sumtot =vec_cs.fit_transform(train['combine'])

# vec_cs = TfidfVectorizer(ngram_range=(1,10),stop_words="english", analyzer='char')
# consum_comp_sumtotchar =vec_cs.fit_transform(train['combine'])

# from scipy.sparse import csr_matrix
# from scipy import sparse
# final_features = sparse.hstack((consum_comp_sumtot,consum_comp_sumtotchar)).tocsr()

# print(1)
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import accuracy_score,f1_score
# X=final_features
# y=train['Complaint-Status']
# X_train,X_val,y_train,y_val = train_test_split(X,y,test_size=0.3,random_state = 1994)
# print(1)
# from sklearn.naive_bayes import MultinomialNB
# from sklearn.linear_model import LogisticRegression
# from sklearn.ensemble import RandomForestClassifier
# from catboost import CatBoostClassifier
# from xgboost import XGBClassifier


    
    
# lr=LogisticRegression(verbose=10,class_weight='balanced',C=5,random_state=1994,n_jobs=-1)
# lr.fit(X_train,y_train)
# print(1)
# lrpred=lr.predict(X_val)
# print(f1_score(y_val,lrpred,average='weighted'))



In [None]:
# def baseline(model,xtrain,ytrain,xval,yval):
#     model.fit(xtrain,ytrain)
#     print('fitted')
#     print(f1_score(yval,model.predict(xval),average='weighted'))

# rf=RandomForestClassifier()  #0.7037876668241548
# xgb=XGBClassifier()
# baseline(xgb,X_train.tocsc(),y_train,X_val.tocsc(),y_val)

In [None]:
train=pd.get_dummies(train,columns=['Transaction-Type','Company-response','Consumer-disputes'],drop_first=True)
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
vec_cr = TfidfVectorizer(ngram_range=(1,2),stop_words="english", analyzer='word')
comp_reason =vec_cr.fit_transform(train['Complaint-reason'])

vec_cr_char = TfidfVectorizer(ngram_range=(1,8),stop_words="english", analyzer='char')
comp_reasonChar =vec_cr_char.fit_transform(train['Complaint-reason'])

# vec_cr_charwb = TfidfVectorizer(ngram_range=(1,8),stop_words="english", analyzer='char_wb')
# comp_reasonCharwb =vec_cr_charwb.fit_transform(train['Complaint-reason'])

vec_cs = TfidfVectorizer(ngram_range=(1,3),stop_words="english", analyzer='word')
consum_comp_sum =vec_cs.fit_transform(train['Consumer-complaint-summary'])

vec_csChar = TfidfVectorizer(ngram_range=(1,9),stop_words="english", analyzer='char')
consum_comp_sumChar =vec_csChar.fit_transform(train['Consumer-complaint-summary'])

# vec_csCharwb = TfidfVectorizer(ngram_range=(1,9),stop_words="english", analyzer='char_wb')
# consum_comp_sumCharwb =vec_csCharwb.fit_transform(train['Consumer-complaint-summary'])

In [None]:
feats=[ 'diff_days', 'diff_year', 'diff_m','Complaint-reasonLen','Consumer-complaint-summaryLen','day','year','month',
       'Transaction-Type_Checking or savings account',
       'Transaction-Type_Consumer Loan', 'Transaction-Type_Credit card',
       'Transaction-Type_Credit card or prepaid card',
       'Transaction-Type_Credit reporting',
       'Transaction-Type_Credit reporting, credit repair services, or other personal consumer reports',
       'Transaction-Type_Debt collection',
       'Transaction-Type_Money transfer, virtual currency, or money service',
       'Transaction-Type_Money transfers', 'Transaction-Type_Mortgage',
       'Transaction-Type_Other financial service',
       'Transaction-Type_Payday loan',
       'Transaction-Type_Payday loan, title loan, or personal loan',
       'Transaction-Type_Prepaid card', 'Transaction-Type_Student loan',
       'Transaction-Type_Vehicle loan or lease',
       'Transaction-Type_Virtual currency',
       'Company-response_Company believes complaint is the result of an isolated error',
       'Company-response_Company believes complaint relates to a discontinued policy or procedure',
       'Company-response_Company believes complaint represents an opportunity for improvement to better serve consumers',
       'Company-response_Company believes it acted appropriately as authorized by contract or law',
       'Company-response_Company believes the complaint is the result of a misunderstanding',
       "Company-response_Company can't verify or dispute the facts in the complaint",
       'Company-response_Company chooses not to provide a public response',
       'Company-response_Company disputes the facts presented in the complaint',
       'Company-response_Company has responded to the consumer and the CFPB and chooses not to provide a public response',
       'Company-response_None', 'Consumer-disputes_Other',
       'Consumer-disputes_Yes','isSameDay']

In [None]:
from scipy.sparse import csr_matrix
from scipy import sparse
final_features = sparse.hstack((train[feats], comp_reason, consum_comp_sum,comp_reasonChar,consum_comp_sumChar)).tocsr()

In [None]:
final_features

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,f1_score
X=final_features
y=train['Complaint-Status']
# X_train,X_val,y_train,y_val = train_test_split(X,y,test_size=0.3,random_state = 1994)

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
# lr=LogisticRegression(verbose=10,class_weight='balanced',C=5,random_state=1994,n_jobs=-1)
# lr.fit(X_train,y_train)
# lrpred=lr.predict(X_val)
# print(f1_score(y_val,lrpred,average='weighted'))
import gc
gc.collect()

In [None]:
# import xgboost as xgb
# clf = xgb.XGBClassifier(
# #                 max_depth = 5,
#                 n_estimators=1000,
# #                 learning_rate=0.1, 
# #                 nthread=4,
# #                 subsample=1.0,
# #                 colsample_bytree=0.5,
# #                 min_child_weight = 3,
# #                 scale_pos_weight = ratio,
# #                 reg_alpha=0.03,
#                 seed=1994,verbose_eval=100)
                
# clf.fit(X_train, y_train, early_stopping_rounds=50, eval_metric="mlogloss",
#         eval_set=[(X_train, y_train), (X_val, y_val)])
        
# p=clf.predict(X_val, ntree_limit=clf.best_iteration)
# print(f1_score(y_val,p,average='weighted'))

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
# lr=LogisticRegression(verbose=10,class_weight='balanced',C=5,random_state=1994,n_jobs=-1,intercept_scaling=2)
# lr.fit(X_train,y_train)
# lrpred=lr.predict(X_val)
# print(f1_score(y_val,lrpred,average='weighted'))


## Predicting

In [None]:
# xgb=XGBClassifier()
# xgb.fit(xtn,y_train)
# cbpred=xgb.predict(xts)
# print(f1_score(y_val,cbpred,average='weighted'))

# from sklearn.neural_network import MLPClassifier
# clf = MLPClassifier(verbose=10)
# clf.fit(X_train, y_train)
# y_pred = clf.predict(X_val)
# print(f1_score(y_val,y_pred,average='weighted'))
test1['Complaint-reason'].fillna('Other',inplace=True)
# test1['Consumer-complaint-summary'].fillna('Other',inplace=True)

In [None]:
# test=test1.copy()
# test['Date-received']=pd.to_datetime(test['Date-received'])
# test['Date-sent-to-company']=pd.to_datetime(test['Date-sent-to-company'])
# test['diff'] = test['Date-sent-to-company'] - test['Date-received']
# test['diff_days']=test['diff']/np.timedelta64(1,'D')
# test['diff_year']=test['diff']/np.timedelta64(1,'Y')
# test['diff_m']=test['diff']/np.timedelta64(1,'M')
# # test['diff_w']=test['diff']/np.timedelta64(1,'W')
# test['Company-response'].fillna('None',inplace=True)
# test['Consumer-disputes'].fillna('Other',inplace=True)
# test['Consumer-complaint-summary']=test['Consumer-complaint-summary'].apply(clean_text)
# test['Complaint-reason']=test['Complaint-reason'].apply(clean_text)
# test['isSameDay']=test['diff_days'].apply(dateSim)

# test['Complaint-reasonLen']=test['Complaint-reason'].apply(len)
# test['Consumer-complaint-summaryLen']=test['Consumer-complaint-summary'].apply(len)

# test.drop(['Date-sent-to-company','Date-received','diff'],axis=1,inplace=True)
# test.head()
test=test1.copy()

In [None]:
test=pd.get_dummies(test,columns=['Transaction-Type','Company-response','Consumer-disputes'],drop_first=True)
comp_reason_test =vec_cr.transform(test['Complaint-reason'])
consum_comp_sum_test =vec_cs.transform(test['Consumer-complaint-summary'])


comp_reason_testchar =vec_cr_char.transform(test['Complaint-reason'])
consum_comp_sum_testchar =vec_csChar.transform(test['Consumer-complaint-summary'])

# comp_reason_testcharwb =vec_cr_charwb.transform(test['Complaint-reason'])
# consum_comp_sum_testcharwb =vec_csCharwb.transform(test['Consumer-complaint-summary'])

In [None]:
final_features_test = sparse.hstack((test[feats], comp_reason_test, consum_comp_sum_test,comp_reason_testchar,consum_comp_sum_testchar)).tocsr()
final_features_test

In [None]:
lr=LogisticRegression(verbose=1,class_weight='balanced',C=5,random_state=1994,n_jobs=-1)
lr.fit(final_features,train['Complaint-Status'].values)
lrpred=lr.predict(final_features_test)

In [None]:
# preds=[]
# from sklearn.model_selection import StratifiedKFold
# kf = StratifiedKFold(n_splits=3,random_state=1994,shuffle=True)
# for train_index,test_index in kf.split(X,y):
# #     print('\n{} of kfold {}'.format(i,kf.n_splits))
#     Xtrain,Xtest = X[train_index],X[test_index]
#     ytrain,ytest = y[train_index],y[test_index]
# #     print(Xtrain.shape,Xtest.shape)
# #     print(ytrain.shape,ytest.shape)
#     lr=LogisticRegression(verbose=1,class_weight='balanced',C=5,random_state=1994,n_jobs=-1)
#     lr.fit(Xtrain,ytrain)
#     lrpred=lr.predict(final_features_test)
#     preds.append(lrpred)

In [None]:
# for i in range(len(preds)):
#     s=pd.DataFrame({'Complaint-ID':test['Complaint-ID'],'Complaint-Status':preds[i]})
#     s.to_csv('lrsKfolds'+str(i)+'.csv',index=False)

In [None]:
s=pd.DataFrame({'Complaint-ID':test['Complaint-ID'],'Complaint-Status':lrpred})
s.head()

In [None]:
s.to_csv('lrs13.csv',index=False)

In [None]:
# s['Complaint-Status']=mbpred
# s.to_csv('mbs1.csv',index=False)