# NLP PROJET 2

## Project: Data Exploration and NLP Modeling
### Deadline: January 22, 2024
Scraping and data exploration

In this second project, you will be tasked with preparing a database by collecting information from various sources, such as customer review sites, articles on cybersecurity, etc.
We can scrap these websites

https://fr.trustpilot.com/

yelp (with api)

https://www.opinion-assurances.fr/

you also can find a website (please validate with your teacher)

You can also use this dataset : https://drive.google.com/file/d/1_kg5JzAzntzLI6eGM3_vmUSoeWk7f8ip/view?usp=sharing


You will then undertake an initial exploration of the data, including data cleaning, visualization, and the production of initial conclusions. This step is crucial to establish a solid foundation for subsequent projects.


### Supervised learning

In the second phase, supervised learning and unsupervised learning will be utilized. You will need to create a supervised text processing model using NLP techniques. In addition to modeling, you will need to develop an interactive application where users can submit text in the chosen theme and receive a prediction, along with explanations for that prediction. This will allow you to apply your NLP skills in a practical manner.

- Examples of supervised tasks
- Sentiment analysis
- Number of stars
- Categories/subjects
- for insurance reviews for examples, claims, subscriptions, etc.
- for restaurants, type of dish, service vs. food
- Streamlit applications
- For one review, give detailed information (food, service, etc.) cf Amazon review
- For one restaurant, a summary about this restaurant, more detailed criteria
- For restaurants, QA system : “I want very good and expensive sushi”, “restaurant rapide et bon”

### Scoring

- Web scraping : 2 points
- Data Cleaning: 2 points (negative points if not well-executed)
    - Highlighting frequent words (and n-grams)
    - Spelling correction: 2 points
- Summary, Translation, and Generation: 2 points
    - Produce a clean file with multiple cleaned columns and corrected/translated texts
- Sentiment Analysis (multiclass, or binary classification): 2 points (possible negative points)
- Topic Modeling and Lists of Topics: 2 points
- Embedding to Identify Similar Words and Enrich Theme List: 2 points (possible negative points)
    - Word2Vec Training: 2 points, GloVe: 1 points
    - Visualization of embeddings with Matplotlib and Tensorboard: 2 points
    - Implementation of Euclidean or cosine distance: 1 point
    - Question answering with semantic search : bonus 2 points
- Supervised Learning, each model well-made and well-presented: 2 points (possible negative points)
    - TF-IDF and classical ML
    - Basic model with an embedding layer (embedding visualization with Tensorboard: additional 1 point)
    - Model with pre-trained embeddings (embedding visualization with Tensorboard: additional 1 point)
    - USE (Universal Sentence Embedding) or equivalents, RNN LSTM, CNN, BERT, or other models on Hugging Face
    - ChatGPT
- Results Interpretation (possible negative points)
    - Error analysis: 1 point
    - Sentiment detection: 2 points
    - Classical models with themes: 2 points
    - Deep learning models for words: 2 points
- Creation of Streamlit applications
    - Prediction (2 points)
    - Summary (2 points)
    - Explanation (3 points)
    - Information Retrieval  (3 points)
    - RAG (3 points)
    - QA (3 points)
- Clarity of Presentation: 2 points (possible negative points)

You can use this template : 
https://docs.google.com/presentation/d/1hyaVKY31U0wP4kensljOgIiudkRC5N1OxZMWqZ07Y5Q/edit?usp=sharing

Template en français
https://docs.google.com/presentation/d/1LGq58zA_5Usmqkz043iHYe3VqDrbQOARXUl_QWD_W3Y/edit?usp=sharing





## SCRAPING YELP

On utilise l'API YELP pour obtenir les données de restaurants.

In [17]:
import requests
import pandas as pd

api_key = "TLjFX1XLAIqbTwfhU5lvYuI0ByzIm5RYudgFZMmoXgyfZ3oIj20y2JQjkG-MLZGDdCXcajXtgiBIgowlGUUssIHxu3GpQu1rV8ZDp41Pp4kDXlY9nfEXYBtev7SSZXYx"
headers = {
    "Authorization": f"Bearer {api_key}"
}

# Paramètres de base pour la recherche de restaurants
url = "https://api.yelp.com/v3/businesses/search"
params = {
    "term": "restaurants",
    "location": "Paris, France",
    "limit": 50  # La limite maximale autorisée par l'API Yelp par requête
}

restaurants = []

# Boucle pour paginer à travers les résultats de l'API Yelp
for offset in range(0, 1000, 50):  # 1000  on peut augmenter
    params['offset'] = offset
    response = requests.get(url, headers=headers, params=params)
    
    # Vérifier si la requête a réussi
    if response.status_code == 200:
        # Ajouter les résultats à la liste des restaurants
        restaurants.extend(response.json()['businesses'])
    else:
        # Si la requête échoue, arrêter la boucle (par exemple, en raison de la limite de taux de l'API)
        print(f"Requête échouée avec le code d'état: {response.status_code}")
        break

# Convertir en DataFrame
restaurants_df = pd.DataFrame(restaurants)
print(f"Nombre total de restaurants récupérés: {len(restaurants_df)}")


Nombre total de restaurants récupérés: 1000


In [27]:
restaurants_df.head()

Unnamed: 0,id,alias,name,image_url,is_closed,url,review_count,categories,rating,coordinates,transactions,price,location,phone,display_phone,distance
0,-0iLH7iQNYtoURciDpJf6w,le-comptoir-de-la-gastronomie-paris,Le Comptoir de la Gastronomie,https://s3-media3.fl.yelpcdn.com/bphoto/xT4YkC...,False,https://www.yelp.com/biz/le-comptoir-de-la-gas...,1300,"[{'alias': 'french', 'title': 'French'}]",4.5,"{'latitude': 48.8645157999652, 'longitude': 2....",[],€€,"{'address1': '34 rue Montmartre', 'address2': ...",33142333132,+33 1 42 33 31 32,370.827517
1,IU9_wVOGBKjfqTTpAXpKcQ,bistro-des-augustins-paris,Bistro des Augustins,https://s3-media2.fl.yelpcdn.com/bphoto/ctHDHM...,False,https://www.yelp.com/biz/bistro-des-augustins-...,484,"[{'alias': 'bistros', 'title': 'Bistros'}, {'a...",4.5,"{'latitude': 48.854754, 'longitude': 2.342119}",[],€€,"{'address1': '39 quai des Grands Augustins', '...",33143540441,+33 1 43 54 04 41,801.11761
2,cEjF41ZQB8-SST8cd3EsEw,l-avant-comptoir-paris-3,L'Avant Comptoir,https://s3-media2.fl.yelpcdn.com/bphoto/V38oU4...,False,https://www.yelp.com/biz/l-avant-comptoir-pari...,657,"[{'alias': 'tapas', 'title': 'Tapas Bars'}, {'...",4.5,"{'latitude': 48.85202, 'longitude': 2.3388}",[],€€,"{'address1': '3 carrefour de l'Odéon', 'addres...",33142384755,+33 1 42 38 47 55,1131.333887
3,BuJnfWI86tTxFUon071EKg,brasserie-bellanger-paris,Brasserie Bellanger,https://s3-media2.fl.yelpcdn.com/bphoto/0IQnl-...,False,https://www.yelp.com/biz/brasserie-bellanger-p...,23,"[{'alias': 'french', 'title': 'French'}]",5.0,"{'latitude': 48.880937, 'longitude': 2.3499401}",[],,{'address1': '140 rue du Faubourg Poissonnière...,33954009965,+33 9 54 00 99 65,2185.842371
4,pztzge22A_c_BfzLHCmaMw,le-bistro-du-périgord-paris-3,Le Bistro du Périgord,https://s3-media2.fl.yelpcdn.com/bphoto/VmDGRn...,False,https://www.yelp.com/biz/le-bistro-du-p%C3%A9r...,377,"[{'alias': 'bistros', 'title': 'Bistros'}]",4.5,"{'latitude': 48.8498092201519, 'longitude': 2....",[],€€€,"{'address1': '71 rue Saint-Jacques', 'address2...",33143296749,+33 1 43 29 67 49,1368.72054


In [32]:
import time
search_params = {"term": "restaurants", "location": "Paris, France"}
search_response = requests.get(url_search, headers=headers, params=search_params)

if search_response.status_code == 200:
    search_data = search_response.json()
    
    # Extraire les identifiants des restaurants
    business_ids = [business['id'] for business in search_data["businesses"]]

# Liste des ID de restaurants
restaurant_ids = restaurants_df['id']

reviews =[]

# Pour chaque restaurant, récupérer les avis
for business_id in restaurant_ids:
    url = f"https://api.yelp.com/v3/businesses/{business_id}/reviews"
    
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()
        review_data = response.json()

        # Ajouter les avis à la liste
        for review in review_data.get("reviews", []):
            reviews.append({
                "business_id": business_id,
                "review_id": review["id"],
                "text": review["text"],
                "rating": review["rating"]
            })

        # Respecter les limites de taux de l'API en ajoutant un délai entre les requêtes
        time.sleep(0.5)  # 0.5 seconde d'attente
    except requests.exceptions.HTTPError as err:
        print(f"Erreur HTTP pour l'ID {business_id}: {err}")
    except requests.exceptions.RequestException as err:
        print(f"Erreur de requête pour l'ID {business_id}: {err}")
        break  # sortir de la boucle en cas d'erreur critique

# Convertir en DataFrame
reviews_df = pd.DataFrame(reviews)
print(f"Nombre total d'avis récupérés: {len(reviews_df)}")

Erreur HTTP pour l'ID wKou-aCDkVzx-KHbALvaXg: 429 Client Error: Too Many Requests for url: https://api.yelp.com/v3/businesses/wKou-aCDkVzx-KHbALvaXg/reviews
Erreur HTTP pour l'ID ZHyD9zngru1naJa3d8XIaQ: 429 Client Error: Too Many Requests for url: https://api.yelp.com/v3/businesses/ZHyD9zngru1naJa3d8XIaQ/reviews
Erreur HTTP pour l'ID kIge67cwT4JXWK7zwXTSmg: 429 Client Error: Too Many Requests for url: https://api.yelp.com/v3/businesses/kIge67cwT4JXWK7zwXTSmg/reviews
Erreur HTTP pour l'ID VNXOBJT5bxX7yr-4tGdElA: 429 Client Error: Too Many Requests for url: https://api.yelp.com/v3/businesses/VNXOBJT5bxX7yr-4tGdElA/reviews
Erreur HTTP pour l'ID KpoXpvctx9Yw5rP7T6FdgQ: 429 Client Error: Too Many Requests for url: https://api.yelp.com/v3/businesses/KpoXpvctx9Yw5rP7T6FdgQ/reviews
Erreur HTTP pour l'ID j0lYErdk8ksysfnY0cKyrw: 429 Client Error: Too Many Requests for url: https://api.yelp.com/v3/businesses/j0lYErdk8ksysfnY0cKyrw/reviews
Erreur HTTP pour l'ID uXZerrwr1lMur96Zx2wGCQ: 429 Client E

KeyboardInterrupt: 

In [37]:
len(reviews)

1234

In [38]:
reviews_df=pd.DataFrame(reviews)

In [39]:
reviews_df.head()

Unnamed: 0,business_id,review_id,text,rating
0,-0iLH7iQNYtoURciDpJf6w,tsubL1mtNvOD1MBSj2ls0Q,"Perfect diary, share the food of this family t...",5
1,-0iLH7iQNYtoURciDpJf6w,sxEFkJ89kyF-wMDUI2ZnWw,"Based on the menu presented, one could write a...",5
2,-0iLH7iQNYtoURciDpJf6w,3MYKaD-tDrUVhRgDh9G4dA,"If you love French OnIon Soup, this is for you...",5
3,IU9_wVOGBKjfqTTpAXpKcQ,PJuWhEzKFz3ipwhOcWMMBA,"Came here with my daughter, son in-law & his m...",5
4,IU9_wVOGBKjfqTTpAXpKcQ,sMcLY9Gpg9ToKqce2MiccQ,"Just a few steps from our hotel, we had wanted...",4


In [None]:
restaurants_df.shape

(1000, 16)

In [40]:
restaurants_df.to_csv("restaurants.csv", sep=';')

In [41]:
reviews_df.shape

(1234, 4)

In [42]:
reviews_df.to_csv("reviews.csv", sep=';')