# Customer Feedback Analysis for Moroccan Banks

## Data collection

After scraping data from the Apify website, which provides bank reviews in Moroccan cities, we obtained a JSON file containing the raw data. The next step in our analysis is to perform data cleaning to ensure the data is accurate, consistent, and ready for further processing. Data cleaning is a crucial step in any data analysis project as it helps eliminate errors, handle missing values, standardize formats, and prepare the data for meaningful insights. In this notebook, we will walk through the process of cleaning the bank reviews data to ensure its quality and reliability.

## Data cleaning and wrangling

#### Cleaning data

In [1]:
# import libraries 
import pandas as pd
import os
import glob
import ast
import numpy as np
import plotly.express as px

In this step, we will combine the data from multiple cities into a single DataFrame for further analysis. The data for each city is stored in separate JSON files. We will read each JSON file, convert it into a pandas DataFrame, and then concatenate all the DataFrames into one.

In [2]:
# Directory path where the JSON files are located
json_dir = '../data/city_data'

# Get a list of all JSON files in the directory
json_files = glob.glob(os.path.join(json_dir, '*.json'))

# Initialize an empty list to store DataFrames
dfs = []

# Iterate over each JSON file
for json_file in json_files:
    # Read the JSON file into a pandas DataFrame
    df = pd.read_json(json_file)
    
    # Append the DataFrame to the list
    dfs.append(df)

# Concatenate all DataFrames into a single DataFrame
combined_df = pd.concat(dfs, ignore_index=True)

# Convert the combined DataFrame to CSV format
combined_df.to_csv('../data/all_cities.csv', index=False)

In [3]:
df = pd.read_csv('../data/all_cities.csv')

  df = pd.read_csv('../data/all_cities.csv')


In [4]:
# keep columns that are useful for analysis 
columns = ['title', 'categoryName', 'city', 'location', 'totalScore', 'rank', 'cid', 'publishedAtDate', 'reviewsCount', 'reviewsDistribution', 'text', 'textTranslated',
            'reviewId', 'reviewerId', 'reviewerNumberOfReviews', 'stars']
df_bank = df[columns]

In [5]:
# drop rows with missing values in title column
df_bank.dropna(subset=['title'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_bank.dropna(subset=['title'], inplace=True)


In [6]:
# import map_labels function from utils.py
from utils import map_labels

In [7]:
# using the function map_labels_to_banks on df_bank['title'] to create a new column 'bank'
df_bank['bank'] = df_bank['title'].map(map_labels(df_bank['title']))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_bank['bank'] = df_bank['title'].map(map_labels(df_bank['title']))


In [8]:
# delete rows with bank = 'unknown' and 'title' column
df_bank = df_bank[df_bank['bank'] != 'unknown']
df_bank = df_bank.drop(columns=['title'])

We read the CSV file into a new DataFrame and selected the relevant columns for analysis. We dropped rows with missing values in the "title" column and mapped the labels in the "title" column to create a new "bank" column. Rows with "bank" values of "unknown" were removed, and the "title" column was dropped.

In [9]:
# put bank column in the first position
cols = df_bank.columns.tolist()
cols = cols[-1:] + cols[:-1]
df_bank = df_bank[cols]

In [10]:
# Convertir la colonne 'location' en dictionnaire
df_bank['location'] = df_bank['location'].apply(ast.literal_eval)

# Extraire 'lat' de la colonne 'location'
df_bank['lat'] = df_bank['location'].apply(lambda x: x['lat'])

# Extraire 'lng' de la colonne 'location'
df_bank['lng'] = df_bank['location'].apply(lambda x: x['lng'])

In [11]:
# drop location column
df_bank.drop(columns=['location'], inplace=True)

To make the location data more usable, we converted the "location" column from a string format to a dictionary format. From the dictionary, we extracted the latitude ("lat") and longitude ("lng") values and created new columns. The original "location" column was then dropped.

In [12]:
# Clean the "publishedAtDate" column
df_bank['publishedAtDate'] = df_bank['publishedAtDate'].apply(lambda x: str(x).split('T')[0] if not pd.isnull(x) else np.nan)
df_bank['publishedAtDate'] = pd.to_datetime(df_bank['publishedAtDate'], errors='coerce')
df_bank['publishedAtDate'] = pd.to_datetime(df_bank['publishedAtDate'], format='%Y-%m-%d')

In [13]:
df_bank.dropna(subset=['publishedAtDate'], inplace=True)

In [14]:
# replace NaN values in 'textTranslated' column with 'text' column
df_bank['textTranslated'] = df_bank['textTranslated'].fillna(df_bank['text'])

In [15]:
# drop text column
df_bank.drop('text', axis=1, inplace=True)

We also cleaned the "publishedAtDate" column by converting it to the desired format and replaced NaN values in the "textTranslated" column with values from the "text" column. The "text" column was subsequently dropped.

In [17]:
fig = px.scatter_mapbox(df_bank, lat="lat", lon="lng", hover_name="bank", hover_data=["stars"])
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r": 0, "t": 0, "l": 0, "b": 0})
fig.show()

In [18]:
# save df_bank to csv file 
df_bank.to_csv('../data/all_cities_cleaned.csv', index=False)

Overall, these steps have prepared our data for further analysis and exploration in our customer feedback analysis platform for Moroccan banks. Well done on completing the data cleaning and wrangling process!