# Overview

In the field of AI/ML, data should be structured in a clear and concise manner so that it may be used to train AI models. This is very important in all fields but especially important in the FinTech industry because of its complex and varied data in the form of customer feedbacks. This dataset represents customer reviews in multiple languages.

# Methodology

To properly structure a dataset, we need to break it down into input features and the corresponding output values. The entire dataset is denoted by *D*. We have defined the dataset's core components:



*   The total number of examples (m)
*   The input variables used for prediction (x)
*   The target variables predicted (y)

Using this notation, the dataset *D* can be expressed as:

$$D = \{<x_{1},y_{1}>,<x_{2},y_{2}>,....,<x_{m},y_{m}>\}$$

This structure can be applied to practically any dataset.

# Multilingual Data Handling

The dataset we have used consists of data in multiple languages. This is a common problem in Natural Language Processing (NLP). For a model to understand the text in these datasets, special techniques are required or we can just translate all the text into a single language.

However, regardless of the language does not change the fundamental structure of the dataset, nor does it change the notation. Hence, the "Review" column is treated as normal input feature.

Now, Data Notation is crucial to various multilingual machine learning models. These area achieved through various annotation techniques such as:

*   Named Entity Recognition (NER): This involves identifying and categorizing various key information in the text like such as names of people, organisations, locations etc.

*   Sentiment Analysis: Annotating text with various sentiment labels allows model to gauge the opinions and emotions expressed in different languages.

*   Text Application: This can be used to categorize different reviews according to the language and gauge reactions according to the culture, tailoring the experience for that particular region or language.

Some methods of detecting languages include:

*   Encoding: We should store and process text in **UTF-8**. This can handle most of the world's languages.

*   Language Detection: If the data doesn't already come with labels for languages then we can use tools like **langdetect**,**fastText** or **langid** to segregrate the languages.

*   Tokenization: We can see what differentiates the languages like Hindi, English, Spanish etc use spaces in the language while Japanese, Chinese etc don't use them. We can also use language-specific tokenizers like spaCy, IndicNLP etc.

*   Normalization and Cleaning: We have to remove standard punctuations, special characters, emojis etc. We can also handle stopwords (words that repeat very often in languages) for multiple languages.

# FinTech Applications

Talking about FinTech applications, this is probably the most used method for dataset notation.

The dataset that we have chosen can used to analyze customer feedback to predict user satisfaction and engagement.

A FinTech company could use this data train a regression model to find a relationship between user's age and review text. This could then allow the company to identify problems and improve its customer service.

The other FinTech applications include:

*   Fraud Detection: By labeling certain transaction data as "fraud" or "scam", then the models will become better at detecting fraudulent activity.

*   Personalised Customer Experience: By annotating customer experiences, support tickets, and feedback with labels (positive, negative and neutral) or intent (account inquiry, transaction dispute), companies can build models that provide personalized experience for customers throught chatbots.

*   Churn Prediction: Predict which users are likely to disengage based on reviews and ratings.

*   Localized Product Insights: Compare feedback across languages/countries to tailor financial services.

#Citations

*   https://learningspiral.ai/data-annotation-for-financial-services-driving-intelligent-decision-making/#:~:text=Personalized%20Customer%20Service%3A%20Sentiment%20analysis,Difficulties%20in%20the%20Path

*   https://macgence.com/blog/banking-data-annotation-solutions/

*   https://activeloc.com/blog/language-data-annotation-in-multilingual-ai/

*   https://medium.com/@thakrandisharth/tackling-global-data-challenges-with-multilingual-data-processing-feb2620d3000

*   https://sigma.ai/what-is-data-annotation/#:~:text=Through%20sentiment%20annotation%2C%20the%20model,based%20on%20their%20subject%20matter.

# Implementation

The code calculates the total training examples (m) by counting the number of instances of data.

we have chosen "review_text" and "user_age" columns as input variables while "num_helpful_votes" and "rating" are the target variables. The given dataset can be used by the company to analyse its products/services based on the "review_text" and "user_age". This will help them to understand the reviews among different age groups. "rating" and "num_helpful_votes" for a particular review will help the company to understand how many people support the review and accordingly they can take actions on improving quality etc.

Then, the code scans each row to print the entire dataset in the notation described before.

Some of the text in "review_text" is a garbled mess because they are in languages whose font could not be loaded.


In [None]:
#Load suitable libraries
import pandas as pd
import numpy as np

import kagglehub
from kagglehub import KaggleDatasetAdapter

file_path = "multilingual_mobile_app_reviews_2025.csv"

print(f"Loading '{file_path}' from Kaggle Hub...")

df = kagglehub.load_dataset(
    KaggleDatasetAdapter.PANDAS,
    "pratyushpuri/multilingual-mobile-app-reviews-dataset-2025",
    file_path,
)

print("Dataset loaded successfully.")


#No. of training examples
m = df.shape[0]

#X is a vector consisting of 2 input features
X = df[['review_text','user_age']]

#Y is a vector consisting of 2 output features
Y = df[['num_helpful_votes','rating']]

#Print the no. of training examples
print("\nNo. of training examples : ",m)

#Print the input features
print("\nX:")
print(X)

#Print the target features
print("\nY:")
print(Y)

#Print the dataset D in the format D=< x1, y1 >, .., < xm, ym>
i=0
print("\nDataset (D) =")
while i<m:
  # Using .iloc[i].values.tolist() is slightly more robust
  print("<", tuple(X.iloc[i].values.tolist()), ",", tuple(Y.iloc[i].values.tolist()), ">")
  i+=1

Loading 'multilingual_mobile_app_reviews_2025.csv' from Kaggle Hub...


  df = kagglehub.load_dataset(


Dataset loaded successfully.

No. of training examples :  2514

X:
                                            review_text  user_age
0     Qui doloribus consequuntur. Perspiciatis tempo...      14.0
1     Great app but too many ads, consider premium v...      18.0
2     The interface could be better but overall good...      67.0
3     Latest update broke some features, please fix ...      66.0
4     Perfect for daily use, highly recommend to eve...      40.0
...                                                 ...       ...
2509  Счастье низкий пастух. Нож неожиданно поезд тр...      21.0
2510  This app is amazing! Really love the new featu...      38.0
2511  This app is amazing! Really love the new featu...      27.0
2512  Invitare convincere pericoloso corsa fortuna. ...      35.0
2513  Latest update broke some features, please fix ...      26.0

[2514 rows x 2 columns]

Y:
      num_helpful_votes  rating
0                    65     1.3
1                   209     1.6
2               

# Conclusion

In short, properly organizing our data is a crucial first step for AI/ML. By labeling the dataset, counting the examples (m), and separating the inputs (x) from the outputs (y), we change messy data into a clear and concise form. This is the basic foundation required for all future work, like training a model. This method helps FinTech companies use their data to understand customer feedback and make better decisions.
