# Navigational Search Pipeline: Masterlist Construction

This notebook processes school-level search logs, identifies navigational queries, and constructs a masterlist of searches for all schools. Each (`device_name_actual`, `school_name`) pair is treated as a distinct user to preserve school-specific analysis.

---
## 1. Settings and Imports

We define file paths, keywords, and suffixes for the school search CSVs. Necessary libraries are imported.
- BASE_PATH: Folder containing cleaned CSVs for each school.

- NAV_DICT_FILE: List of known navigational sites.

- KEYWORDS: Words added to site names to catch variations (online, web, login, channel).

- FILE_SUFFIX: The suffix of cleaned school CSVs (_all_searches_tagged.csv).

- OUTPUT_MASTER: Path to save the final master file.

In [None]:
import os
import pandas as pd
import re
from fuzzywuzzy import fuzz

# =====================================================
# SETTINGS
# =====================================================
BASE_PATH = "/Users/tdf/Downloads/q_episode_processing/cleaned"  # Folder with school subfolders
NAV_DICT_FILE = "/Users/tdf/Downloads/navigational_dictionary.csv"
OUTPUT_MASTER = "/Users/tdf/Downloads/q_episode_processing/master_all_schools.csv"
KEYWORDS = ["online", "web", "login", "channel"]
FILE_SUFFIX = "_all_searches_tagged.csv"  # Each school's cleaned CSV

---
## 2. Load Navigational Terms

We build a comprehensive list of navigational search terms using the provided dictionary, padding each site name with keywords such as `"online"`, `"web"`, `"login"`, and `"channel"`. Terms are normalized to lowercase for case-insensitive matching.

In [None]:
nav_dict = pd.read_csv(NAV_DICT_FILE)
nav_terms = []

for site in nav_dict['site_name'].dropna().unique():
    nav_terms.append(site.lower())
    for kw in KEYWORDS:
        nav_terms.append(f"{site.lower()} {kw}")

nav_terms = list(set(nav_terms))

---
## 3. Helper Functions

Two key functions are defined:

- **`extract_query`(uri)**: Extracts the search query from a URI and normalizes it to lowercase.

- **`is_navigational`(query)**: Flags a query as navigational if it matches any term in the expanded dictionary using fuzzy or substring matching.

- **`is_utility_query`(query)**: Flags utility URLs (/url, uviewer, ogs.google) which are excluded from navigational classification.

In [None]:
def extract_query(uri):
    """Extract the search query from a URI, normalize to lowercase."""
    if pd.isna(uri):
        return ""
    match = re.search(r"[?&]q=([^&]+)", uri)
    if match:
        return match.group(1).replace('+', ' ').lower()
    return str(uri).lower()

def is_navigational(query):
    """Return True if query matches any navigational term."""
    for term in nav_terms:
        if fuzz.partial_ratio(query, term) >= 65:
            return True
        if term in query:
            return True
    return False

utility_patterns = [r'/url', r'uviewer', r'ogs\.google']

def is_utility_query(query):
    """Return True if query is a utility URL that should be ignored."""
    return any(re.search(pattern, query) for pattern in utility_patterns)

## 4. Process School Files

For each school folder:

- Load the cleaned CSV (`*_all_searches_tagged.csv`).

- Add the school_name column to ensure that the same device_name_actual across schools is treated as distinct users.

- Extract the query, remove duplicates, classify navigational searches, and assign q_episode.

- Classify navigational queries (`is_navigational`).

- Sort by device and timestamp for episode assignment.

In [None]:
all_dfs = []

for school_folder in os.listdir(BASE_PATH):
    folder_path = os.path.join(BASE_PATH, school_folder)
    if not os.path.isdir(folder_path):
        continue

    for file in os.listdir(folder_path):
        if not file.endswith(FILE_SUFFIX):
            continue
        full_path = os.path.join(folder_path, file)
        print("Loading:", full_path)

        df = pd.read_csv(full_path)

        # Add school_name column (important for identity)
        df['school_name'] = school_folder

        # Extract query and remove duplicates
        df['search_q'] = df['uri'].apply(extract_query)
        df = df.drop_duplicates(subset=['device_name_actual', 'search_q', 'created_at'])

        # Classify navigational searches, excluding utility URLs
        df['is_navigational'] = df['search_q'].apply(
            lambda q: is_navigational(q) and not is_utility_query(q)
        )

        # Sort for q_episode calculation
        df = df.sort_values(['device_name_actual', 'created_at'])

        # Initialize q_episode
        df['q_episode'] = 0


---
## 5. Episode Assignment

- `q_episode` = 0 for navigational and utility queries.

- **Non-navigational queries**:
    - Queries **within 5 minutes** of the last meaningful query remain in the **same episode**.
    - Queries **more than 5 minutes** apart start a **new episode**.
- Episode counter resets per (`device_name_actual`, `school_name`), so the same device in different schools is treated independently.

In [None]:
        # Assign q_episode per user per school
        for (device_id, school), user_data in df.groupby(['device_name_actual', 'school_name']):
            last_time = None
            episode = 1
            for idx, row in user_data.iterrows():
                if row['is_navigational']:
                    df.at[idx, 'q_episode'] = 0
                else:
                    if last_time is None:
                        df.at[idx, 'q_episode'] = episode
                    else:
                        delta = (pd.to_datetime(row['created_at']) - pd.to_datetime(last_time)).total_seconds()
                        if delta > 5*60:  # New episode if gap > 5 min
                            episode += 1
                        df.at[idx, 'q_episode'] = episode
                    last_time = row['created_at']

        all_dfs.append(df)


---
## 6. Concatenate All Schools

Combine all individual school dataframes into a master dataframe for downstream analysis.

In [None]:
master_df = pd.concat(all_dfs, ignore_index=True)
print("Master dataframe shape:", master_df.shape)

---
## 7. Export Master File

Save the master dataframe to CSV for further analysis.

In [None]:
master_df.to_csv(OUTPUT_MASTER, index=False)
print("Saved master file to:", OUTPUT_MASTER)