# ChatGPT Version of the Response

If you’re using the **TMDb API** to build a **movie recommender system** with **Apache Airflow**, here’s a structured **end-to-end approach**:

---

## **Step 1: Define the Architecture**
Your workflow involves:
1. **Data Ingestion** – Extract movie data from TMDb API.
2. **Data Storage** – Store movie metadata in a distributed database (e.g., Snowflake, BigQuery, or PostgreSQL).
3. **Feature Engineering** – Transform and preprocess the data.
4. **Model Training** – Build a recommender system using collaborative filtering, content-based filtering, or hybrid models.
5. **Model Deployment & API** – Deploy a model that serves recommendations via an API.

### **Tech Stack**
- **Apache Airflow** (Orchestrator)
- **TMDb API** (Data Source)
- **Snowflake/BigQuery/PostgreSQL** (Storage)
- **Spark/Pandas** (Processing)
- **Scikit-learn/Surprise/LightFM** (Modeling)
- **Flask/FastAPI** (API for serving recommendations)

---

## **Step 2: Set Up Apache Airflow DAG for Data Ingestion**
The first DAG will:
- Query TMDb API daily for new/updated movies.
- Store the data in a database.

### **Tasks for Airflow DAG**
1. **Extract Data from TMDb API**
   - Use `requests` to fetch movie metadata.
   - Query popular, trending, and newly released movies.
2. **Transform Data**
   - Normalize JSON response.
   - Extract relevant fields (movie ID, title, genre, cast, etc.).
3. **Load Data into a Database**
   - Store the movie data in **Snowflake, PostgreSQL, or BigQuery**.

### **Airflow DAG Code (ETL Pipeline)**
```python
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
import requests
import pandas as pd
import sqlalchemy

# TMDb API Config
API_KEY = "your_tmdb_api_key"
BASE_URL = "https://api.themoviedb.org/3"

# Database Connection (PostgreSQL Example)
DATABASE_URI = "postgresql+psycopg2://user:password@host:5432/moviedb"
engine = sqlalchemy.create_engine(DATABASE_URI)

def fetch_movies():
    """Fetch movie data from TMDb API."""
    url = f"{BASE_URL}/movie/popular?api_key={API_KEY}&language=en-US&page=1"
    response = requests.get(url)
    data = response.json()["results"]
    
    # Convert to DataFrame
    df = pd.DataFrame(data, columns=["id", "title", "release_date", "vote_average", "genre_ids", "overview"])
    df.to_sql("movies", con=engine, if_exists="append", index=False)

default_args = {
    "owner": "airflow",
    "depends_on_past": False,
    "start_date": datetime(2024, 2, 1),
    "retries": 1,
    "retry_delay": timedelta(minutes=5),
}

dag = DAG(
    "tmdb_movie_etl",
    default_args=default_args,
    schedule_interval="@daily",
    catchup=False,
)

fetch_movies_task = PythonOperator(
    task_id="fetch_movies",
    python_callable=fetch_movies,
    dag=dag,
)

fetch_movies_task
```
---
## **Step 3: Feature Engineering**
- Process the stored movie data.
- Convert genres and keywords into **one-hot encoded vectors**.
- Extract text-based features from movie descriptions (`TF-IDF`).
- Normalize ratings for similarity computation.

Example: **Processing Genres for Content-Based Filtering**
```python
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

df = pd.read_sql("SELECT id, title, genre_ids FROM movies", con=engine)

# Convert genre_ids to list
df["genre_ids"] = df["genre_ids"].apply(lambda x: eval(x) if isinstance(x, str) else x)

# One-hot encode genres
mlb = MultiLabelBinarizer()
genres_encoded = pd.DataFrame(mlb.fit_transform(df["genre_ids"]), columns=mlb.classes_)
df = df.drop(columns=["genre_ids"]).join(genres_encoded)

df.to_sql("movies_processed", con=engine, if_exists="replace", index=False)
```
---
## **Step 4: Build the Recommender System**
You can use **two main approaches**:
### **1. Content-Based Filtering (Using Cosine Similarity)**
- Compute similarity based on **genres, keywords, descriptions**.

```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load preprocessed data
df = pd.read_sql("SELECT * FROM movies_processed", con=engine)

# Convert overview text into TF-IDF features
tfidf = TfidfVectorizer(stop_words="english")
tfidf_matrix = tfidf.fit_transform(df["overview"].fillna(""))

# Compute similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Store similarity matrix
import numpy as np
np.save("movie_similarity.npy", cosine_sim)
```

### **2. Collaborative Filtering (Using Surprise)**
- Uses user ratings to recommend movies.

```python
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split

# Load user ratings
df_ratings = pd.read_sql("SELECT user_id, movie_id, rating FROM ratings", con=engine)

# Prepare dataset
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df_ratings[["user_id", "movie_id", "rating"]], reader)

# Train model
trainset, testset = train_test_split(data, test_size=0.2)
model = SVD()
model.fit(trainset)

# Save model
import pickle
with open("collab_filtering_model.pkl", "wb") as f:
    pickle.dump(model, f)
```
---
## **Step 5: Deploy the Recommender System**
Once trained, deploy a **Flask or FastAPI** service that serves recommendations.

### **Flask API for Movie Recommendations**
```python
from flask import Flask, request, jsonify
import numpy as np

app = Flask(__name__)

# Load similarity matrix
cosine_sim = np.load("movie_similarity.npy")

@app.route("/recommend", methods=["GET"])
def recommend():
    movie_id = int(request.args.get("movie_id"))
    idx = df[df["id"] == movie_id].index[0]
    
    # Get top similar movies
    similar_indices = cosine_sim[idx].argsort()[-6:-1][::-1]
    recommendations = df.iloc[similar_indices][["id", "title"]].to_dict(orient="records")

    return jsonify(recommendations)

if __name__ == "__main__":
    app.run(debug=True)
```
---
## **Step 6: Automate Everything with Airflow**
- **Ingestion DAG** – Fetch new data daily.
- **Processing DAG** – Feature engineering weekly.
- **Model Retraining DAG** – Train recommender model monthly.
- **API Deployment DAG** – Deploy new model when retrained.

---
## **Final Thoughts**
🎯 **By following this pipeline, you'll have:**
- **Daily-updated** movie data from TMDb.
- **A structured and scalable recommender system.**
- **An Airflow-based automated pipeline.**
- **A deployed API serving recommendations.**

Would you like help setting up Airflow DAGs for **model training and deployment** as well? 🚀

# Claude version of the response

I'll help you design a complete workflow for building a movie recommender system using TMDB API, Airflow, and MongoDB. Let me create an example implementation.



I've created a complete Airflow DAG that implements a movie recommender system. Here's how the workflow functions:

1. **Data Collection (Task 1)**:
   - Fetches popular movies from TMDB API
   - Stores raw data in MongoDB's `raw_movies` collection
   - This creates our initial dataset

2. **Data Enrichment (Task 2)**:
   - Retrieves additional details for each movie (genres, runtime, budget)
   - Fetches cast and crew information
   - Stores enriched data in the `enriched_movies` collection

3. **Feature Engineering (Task 3)**:
   - Extracts relevant features from the enriched data
   - Creates feature vectors including genres, cast, crew, and metrics
   - Stores these in the `feature_vectors` collection

4. **Recommendation Generation (Task 4)**:
   - Builds a similarity matrix using cosine similarity
   - Generates movie recommendations based on content features
   - Stores recommendations in the `recommendations` collection

MongoDB is integrated throughout the workflow in several meaningful ways:
- Acts as a data lake for raw TMDB data
- Stores enriched movie information
- Maintains feature vectors for quick access
- Persists final recommendations for serving

To use this system:

1. Set up your environment:
```bash
pip install apache-airflow pymongo pandas scikit-learn requests
```

2. Configure MongoDB and TMDB API:
- Replace `'mongodb://localhost:27017/'` with your MongoDB connection string
- Add your TMDB API key

3. Deploy the DAG to your Airflow instance

Would you like me to explain any specific part in more detail or add additional features to the system?

In [None]:
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from pymongo import MongoClient
import requests
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# DAG default arguments
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2025, 2, 1),
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

# MongoDB connection setup
def get_mongo_client():
    return MongoClient('mongodb://localhost:27017/')

# TMDB API configuration
TMDB_API_KEY = 'your_api_key_here'
TMDB_BASE_URL = 'https://api.themoviedb.org/3'

# Task 1: Fetch movie data from TMDB
def fetch_tmdb_data():
    movies = []
    for page in range(1, 11):  # Fetch first 10 pages
        response = requests.get(
            f'{TMDB_BASE_URL}/movie/popular',
            params={
                'api_key': TMDB_API_KEY,
                'page': page
            }
        )
        movies.extend(response.json()['results'])
    
    # Store in MongoDB
    client = get_mongo_client()
    db = client['movie_recommender']
    collection = db['raw_movies']
    collection.insert_many(movies)
    client.close()

# Task 2: Process and enrich movie data
def process_movie_data():
    client = get_mongo_client()
    db = client['movie_recommender']
    raw_movies = list(db['raw_movies'].find())
    
    # Enrich with additional movie details and credits
    enriched_movies = []
    for movie in raw_movies:
        movie_id = movie['id']
        # Fetch additional details
        details = requests.get(
            f'{TMDB_BASE_URL}/movie/{movie_id}',
            params={'api_key': TMDB_API_KEY}
        ).json()
        
        # Fetch credits
        credits = requests.get(
            f'{TMDB_BASE_URL}/movie/{movie_id}/credits',
            params={'api_key': TMDB_API_KEY}
        ).json()
        
        # Combine data
        enriched_movie = {
            **movie,
            'genres': details.get('genres', []),
            'runtime': details.get('runtime', 0),
            'budget': details.get('budget', 0),
            'revenue': details.get('revenue', 0),
            'cast': credits.get('cast', [])[:5],  # Top 5 cast members
            'crew': credits.get('crew', [])[:5]   # Top 5 crew members
        }
        enriched_movies.append(enriched_movie)
    
    # Store enriched data
    db['enriched_movies'].insert_many(enriched_movies)
    client.close()

# Task 3: Build feature vectors
def build_feature_vectors():
    client = get_mongo_client()
    db = client['movie_recommender']
    movies = list(db['enriched_movies'].find())
    
    # Create feature vectors
    feature_vectors = []
    for movie in movies:
        # Extract genres
        genres = [genre['name'] for genre in movie['genres']]
        
        # Extract cast and crew
        cast = [member['name'] for member in movie['cast']]
        directors = [member['name'] for member in movie['crew'] 
                    if member['job'] == 'Director']
        
        # Create feature dictionary
        features = {
            'movie_id': movie['id'],
            'title': movie['title'],
            'genres': genres,
            'cast': cast,
            'directors': directors,
            'popularity': movie['popularity'],
            'vote_average': movie['vote_average'],
            'vote_count': movie['vote_count']
        }
        feature_vectors.append(features)
    
    # Store feature vectors
    db['feature_vectors'].insert_many(feature_vectors)
    client.close()

# Task 4: Generate recommendations
def generate_recommendations():
    client = get_mongo_client()
    db = client['movie_recommender']
    features = list(db['feature_vectors'].find())
    
    # Create similarity matrix
    df = pd.DataFrame(features)
    
    # Create genre matrix (one-hot encoding)
    genres = df['genres'].explode().unique()
    genre_matrix = pd.DataFrame(
        [[1 if genre in movie['genres'] else 0 for genre in genres] 
         for movie in features],
        columns=genres
    )
    
    # Normalize numerical features
    numerical_features = df[['popularity', 'vote_average', 'vote_count']]
    normalized_features = (numerical_features - numerical_features.mean()) / numerical_features.std()
    
    # Combine features
    final_features = pd.concat([genre_matrix, normalized_features], axis=1)
    
    # Calculate similarity matrix
    similarity_matrix = cosine_similarity(final_features)
    
    # Generate recommendations for each movie
    recommendations = []
    for i, movie in enumerate(features):
        # Get top 5 similar movies
        similar_indices = similarity_matrix[i].argsort()[-6:-1][::-1]
        recommendation = {
            'movie_id': movie['movie_id'],
            'title': movie['title'],
            'recommended_movies': [
                {
                    'movie_id': features[idx]['movie_id'],
                    'title': features[idx]['title'],
                    'similarity_score': float(similarity_matrix[i][idx])
                }
                for idx in similar_indices
            ]
        }
        recommendations.append(recommendation)
    
    # Store recommendations
    db['recommendations'].insert_many(recommendations)
    client.close()

# Create DAG
dag = DAG(
    'movie_recommender',
    default_args=default_args,
    description='Movie recommender system pipeline',
    schedule_interval=timedelta(days=1),
    catchup=False
)

# Define tasks
t1 = PythonOperator(
    task_id='fetch_tmdb_data',
    python_callable=fetch_tmdb_data,
    dag=dag,
)

t2 = PythonOperator(
    task_id='process_movie_data',
    python_callable=process_movie_data,
    dag=dag,
)

t3 = PythonOperator(
    task_id='build_feature_vectors',
    python_callable=build_feature_vectors,
    dag=dag,
)

t4 = PythonOperator(
    task_id='generate_recommendations',
    python_callable=generate_recommendations,
    dag=dag,
)

# Set task dependencies
t1 >> t2 >> t3 >> t4