## Basic text cleaning and string matching notebook
This notebook is a version of the below two notebooks. I tried not to copy too much from them because I wanted to work through it on my own, but the concpet is basically the same.

* https://www.kaggle.com/manabendrarout/tabular-data-preparation-basic-eda-and-baseline
* https://www.kaggle.com/josephassaker/coleridge-initiative-eda-na-ve-submission

I did this because I just wanted to do a basic exercise to work python/pandas text manipulation and text matching so that I can get a handle on the data a little bit, and set up the problem.

In [None]:
# You don't need much! Not even numpy.
import pandas as pd
import regex as re
import glob

In [None]:
train_df = pd.read_csv('../input/coleridgeinitiative-show-us-the-data/train.csv')
test_files_path = '../input/coleridgeinitiative-show-us-the-data/test'
train_df.head()

# Note that there's no need the training json files for this simple notebook.
#train_files_path = '../input/coleridgeinitiative-show-us-the-data/train'

# Also don't need the submission CSV as it's generated from the test data. 
# This is a different approach from the notebooks linked at the top, I think a bit more realistic
#sample_sub = pd.read_csv('../input/coleridgeinitiative-show-us-the-data/sample_submission.csv')

# Get the names of unique datasets
Running the training data list of datasets through a .unique and tokenizing them through the gven text cleaning function.

In [None]:
# Get a list of unique publicaitons from the training set and put them into a new df
df_unique_data_sets = pd.DataFrame(train_df['dataset_title'].unique().tolist(), columns = ['data_sets']) 

# Run the publications through the official text cleaning function .
# I converted the function to a lambda because I think it's a little nicer 
# and I also wanted some practice working with lambdas.
df_unique_data_sets['datasets_cleaned'] = df_unique_data_sets.apply(
   lambda txt: re.sub('[^A-Za-z0-9]+', ' ', str(txt['data_sets']).lower()).strip(), axis =1)

df_unique_data_sets.head()

# Get all text from publications
Get the list of json files using Glob, loop through all files and pull out and put them in a pandas DF
Then run the article text through the clenaing function. This is so that it'll have a better chance of matching the dataset text later in the notebook.

In [None]:
# Get a list of all json files in the test data set
test_files = glob.glob("../input/coleridgeinitiative-show-us-the-data/test/*.json")

df_test_publications = pd.DataFrame()
for test_file in test_files: 
    file_data = pd.read_json(test_file) #read the JSON from the test files
    
    # Pull out an parse each line of test json file name into pub_id column
    file_data.insert(0,'pub_id', test_file.split('/')[-1].split('.')[0]) 
    
    #concat the pub id's with JSON pulled above
    df_test_publications = pd.concat([df_test_publications, file_data])

df_test_publications.head()

In [None]:
# Run all text in the training DF through the text cleaning function
df_test_publications['text_cleaned'] = df_test_publications.apply(
    lambda txt: re.sub('[^A-Za-z0-9]+', ' ', str(txt['text']).lower()).strip(), axis =1)

df_test_publications.head()

# Match dataset titles with text
The block here is doing an Excel vlookup style string matching function. The loop goes through each publication and looks for matches to datasets titles.

In [None]:
import warnings # suppressing some warnings with str.contains
warnings.filterwarnings("ignore", 'This pattern has match groups')


# this loop goes through all unique dataset titles and looks for them in the publications
# Uses versions of both the datasets and publications that have been put through the tokenizer
df_sub = pd.DataFrame() #blank df to append to at the end of the loop

for i in range (len(df_unique_data_sets.index)):

#make a temp df with only matched datasets to publications
    dfx = df_test_publications[df_test_publications['text_cleaned'].str.contains( df_unique_data_sets['datasets_cleaned'][i])]

# The below produces a "set with copy" warning, 
# but it's fine as dfx is just temporary and only used for appending to df_sub.
    dfx['PredictionString'] = df_unique_data_sets['datasets_cleaned'][i] # new column with the pub name
    df_sub = df_sub.append(dfx[['pub_id', 'PredictionString']]) 

df_sub.rename(columns={"pub_id": "Id"}, inplace = True)
df_sub.sort_values('PredictionString', inplace = True) #contest requires alphabetical order
df_sub.head(20)

# Format predictions for submission
Above we have all preictions, but the contest wants each line in the submission to only have one id and all the datasets seperated by a bar.

Also in the resulting DF above there is a lot of redundancy that we can easily strip out.

In [None]:
# Last step is to put the prediction strings together
unique_publications = df_sub['Id'].unique().tolist()

id_pred_dict={}
for n in unique_publications:
    dfxx = df_sub[df_sub['Id']== n] # filter on Id each pass of the loop
    unique_preds = dfxx['PredictionString'].unique().tolist() # pull out a unique list from the filtered DF
    strung = '|'.join(unique_preds) #make a string with bar concatonation
    id_pred_dict[n] = strung

df_sub1 = pd.DataFrame(id_pred_dict.items(), columns=['id', 'PredictionString'])
df_sub1

In [None]:
# and finally, put it in a CSV for submission
df_sub1.to_csv('submission.csv', index=False)