## Project 01 - Claims Fraud Detection using NLP

Project Idea: Build a model to detect fraudulent insurance claims by analyzing the textual content of claim reports.

Steps:

    Collect and preprocess claim data.
    Use NLP to extract features from the textual descriptions.
    Train a machine learning model to classify claims as fraudulent or not.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Sample data
data = {
    'claim_id': [1, 2, 3, 4, 5],
    'claim_text': [
        "The car was stolen while parked in a secure area.",
        "I slipped and fell at work, breaking my arm.",
        "My house was flooded due to a burst pipe.",
        "I was involved in a minor car accident, no injuries.",
        "The laptop was stolen from my office desk."
    ],
    'fraudulent': [0, 0, 1, 0, 1]
}
df = pd.DataFrame(data)

# Preprocess data
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['claim_text'])
y = df['fraudulent']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))



A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\NSS\AppData\Roaming\Python\Python312\site-packages\ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "C:\Users\NSS\AppData\Roaming\Python\Python312\site-packages\traitlets\config\application.py", line 1075, in launch_instance
    app.start()
  File "C:\Users\NSS\AppData\Roaming\Python\Python312\site-packages\ipykernel\kernelapp.py", line 739, in start
    self.io_loop.sta

AttributeError: _ARRAY_API not found


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\NSS\AppData\Roaming\Python\Python312\site-packages\ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "C:\Users\NSS\AppData\Roaming\Python\Python312\site-packages\traitlets\config\application.py", line 1075, in launch_instance
    app.start()
  File "C:\Users\NSS\AppData\Roaming\Python\Python312\site-packages\ipykernel\kernelapp.py", line 739, in start
    self.io_loop.sta

AttributeError: _ARRAY_API not found

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

## project 2 - Customer Sentiment Analysis for Insurance Products

Project Idea: Analyze customer reviews or feedback about insurance products to understand sentiment and improve services.

Steps:

    Collect customer feedback data.
    Use NLP to preprocess and analyze sentiment.
    Visualize the sentiment trends over time.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import matplotlib.pyplot as plt

# Sample data
data = {
    'review_id': [1, 2, 3, 4, 5],
    'review_text': [
        "Great customer service and quick claims processing.",
        "Terrible experience, will not recommend.",
        "Satisfactory service, but could be better.",
        "Excellent coverage and support.",
        "Horrible, denied my claim without reason."
    ]
}
df = pd.DataFrame(data)

# Preprocess data using VADER
sid = SentimentIntensityAnalyzer()
df['sentiment'] = df['review_text'].apply(lambda x: sid.polarity_scores(x)['compound'])

# Visualize sentiment
plt.figure(figsize=(10, 6))
df['sentiment'].hist(bins=10)
plt.title('Customer Sentiment Distribution')
plt.xlabel('Sentiment Score')
plt.ylabel('Frequency')
plt.show()


## Project 3 - Policy Document Classification

Project Idea: Classify insurance policy documents into different categories using NLP techniques.

Steps:

    Collect policy documents and label them.
    Preprocess the text data.
    Train a machine learning model to classify the documents.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Sample data
data = {
    'policy_id': [1, 2, 3, 4, 5],
    'policy_text': [
        "This policy covers health insurance with annual check-ups.",
        "The policy provides auto insurance with comprehensive coverage.",
        "Home insurance policy covering natural disasters.",
        "Travel insurance policy including medical emergencies.",
        "Life insurance policy with a term of 20 years."
    ],
    'category': ['Health', 'Auto', 'Home', 'Travel', 'Life']
}
df = pd.DataFrame(data)

# Preprocess data
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['policy_text'])
y = df['category']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))
