## 0. System Setup + (Cleaning later)

In [8]:
import pandas as pd
import numpy as np
import math

In [2]:
# --- Load data ---
df = pd.read_csv("survey.csv")

# 1. Geographic profile of respondents
**ZIP code distribution**
* Frequency table of Zip Code (Q4).
* Top 10 ZIP codes, plus “Other”.
* Map ZIPs to “Cleveland vs suburban vs out-of-state” if you want.

**ZIP/Region/Distance from West Side Market vs**:
* Frequency of visit (Q8).
* Last visit recency (Q9).
* Typical spend per visit (Q11).
* Travel mode (Q7).

In [6]:
# --- Clean ZIP codes (extract proper 5-digit ZIP or mark missing) ---
df['ZIP_clean'] = (
    df['Q4']
    .astype(str)
    .str.extract(r'(\d{5})')[0]
    .fillna("MISSING")
)

# --- Region mapping (Cleveland vs suburban vs out-of-state) ---
cleveland_zips = {
    "44102","44113","44109","44144","44107","44114","44115","44106","44105",
    "44111","44103","44110","44104","44108","44112","44120"
}

ohio_zips = {str(z).zfill(5) for z in range(43000, 46000)}

def classify_region(zipcode):
    if zipcode in cleveland_zips:
        return "Cleveland"
    elif zipcode in ohio_zips:
        return "Ohio (Suburban)"
    elif zipcode == "MISSING":
        return "Missing"
    else:
        return "Out-of-state"

df['Region'] = df['ZIP_clean'].apply(classify_region)

# --- Cleveland ZIP → Neighborhood mapping ---
zip_to_neighborhood = {
    "44102": "Detroit–Shoreway / Cudell",
    "44113": "Ohio City / Tremont / Clark-Fulton",
    "44109": "Old Brooklyn",
    "44144": "Brooklyn / Linndale",
    "44107": "Lakewood",
    "44114": "Downtown / Asiatown",
    "44115": "Downtown Cleveland",
    "44106": "University Circle / Little Italy",
    "44105": "Slavic Village",
    "44111": "West Park / Kamm’s Corners",
    "44103": "Hough / St. Clair–Superior",
    "44110": "Collinwood",
    "44104": "Central / Kinsman",
    "44108": "Glenville",
    "44112": "East Cleveland",
    "44120": "Shaker Square / Buckeye-Woodhill / Shaker Heights"
}

def map_neighborhood(zipcode):
    if zipcode in zip_to_neighborhood:
        return zip_to_neighborhood[zipcode]
    elif zipcode == "MISSING":
        return "Missing ZIP"
    elif zipcode in ohio_zips:
        return "Ohio Suburb"
    else:
        return "Out-of-state"

df['ZIP_neighborhood'] = df['ZIP_clean'].apply(map_neighborhood)

# --- Frequency of neighborhoods (exclude Missing ZIP) ---
nh_counts = (
    df[df['ZIP_neighborhood'] != "Missing ZIP"]['ZIP_neighborhood']
    .value_counts()
    .reset_index()
)
nh_counts.columns = ['Neighborhood', 'count']

# Top 10 neighborhoods
top10_neighborhoods = nh_counts.head(10)['Neighborhood'].tolist()

# --- Neighborhood bucket column ---
df['Neighborhood_bucket'] = df['ZIP_neighborhood'].apply(
    lambda n: n if n in top10_neighborhoods else "Other"
)

# --- Missing ZIP count ---
missing_zip_count = (df['ZIP_neighborhood'] == "Missing ZIP").sum()

# ======================================================
# OUTPUTS
# ======================================================

print("Top 10 ZIP Neighborhoods:")
display(nh_counts.head(10))

print(f"\nMissing ZIP count: {missing_zip_count}")

print("\nNeighborhood Bucket Counts (Top 10 + Other):")
display(df['Neighborhood_bucket'].value_counts().reset_index())

print("\nRegions:")
display(df['Region'].value_counts().reset_index())


Top 10 ZIP Neighborhoods:


Unnamed: 0,Neighborhood,count
0,Ohio Suburb,377
1,University Circle / Little Italy,78
2,Lakewood,40
3,Ohio City / Tremont / Clark-Fulton,32
4,Out-of-state,29
5,Detroit–Shoreway / Cudell,24
6,Old Brooklyn,24
7,West Park / Kamm’s Corners,20
8,Brooklyn / Linndale,10
9,Shaker Square / Buckeye-Woodhill / Shaker Heights,8



Missing ZIP count: 30

Neighborhood Bucket Counts (Top 10 + Other):


Unnamed: 0,Neighborhood_bucket,count
0,Ohio Suburb,377
1,University Circle / Little Italy,78
2,Other,44
3,Lakewood,40
4,Ohio City / Tremont / Clark-Fulton,32
5,Out-of-state,29
6,Detroit–Shoreway / Cudell,24
7,Old Brooklyn,24
8,West Park / Kamm’s Corners,20
9,Brooklyn / Linndale,10



Regions:


Unnamed: 0,Region,count
0,Ohio (Suburban),377
1,Cleveland,250
2,Missing,30
3,Out-of-state,29


## Distance from WSM (for later)
**Haversine Formula**
The Earth is curved, so straight-line Euclidean distance is inaccurate. The Haversine formula instead calculates great-circle distance, which is the shortest path between two points on the surface of a sphere.

In [10]:
# West Side Market coordinates
WSM_LAT = 41.4840
WSM_LON = -81.7031

def haversine(lat1, lon1, lat2=WSM_LAT, lon2=WSM_LON):
    """
    Calculate distance in miles using Haversine formula.
    """
    R = 3958.8  # Earth radius in miles

    lat1, lon1, lat2, lon2 = map(math.radians, [lat1, lon1, lat2, lon2])

    dlat = lat2 - lat1
    dlon = lon2 - lon1

    a = (math.sin(dlat/2)**2 +
         math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2)

    c = 2 * math.asin(math.sqrt(a))

    return R * c

In [12]:
df['lat'] = pd.to_numeric(df['LocationLatitude'], errors='coerce')
df['lon'] = pd.to_numeric(df['LocationLongitude'], errors='coerce')

In [13]:
df['distance_miles'] = df.apply(
    lambda r: haversine(r['lat'], r['lon'])
    if pd.notnull(r['lat']) and pd.notnull(r['lon'])
    else None,
    axis=1
)

# 2. Awareness & Familiarity
**Are you familiar with the West Side Market? (Q3)**
* Proportion familiar vs not familiar.
* Compare familiarity vs:
    * Impressions (Q10).
    * Visit frequency (Q8).
    * Reasons for not visiting (Q19).
    * Age group (Q5).
    * Student vs non-student (Q20).
    * Distance from the market (`df['distance_miles]`) or region (`df[ZIP_neighborhood]`)

**Impression of the market (Q10)**
* Distribution (Very negative → Very positive).
* Compare impression vs:
    * Satisfaction metrics (Q13_1–Q13_5).
    * Visit frequency & last visit.
    * Typical spend.

# 3. Visit behavior & travel patterns

**How often do you visit? (Q8)**
* Bar chart of categories (weekly, monthly, rarely, etc.).
* Cross-tab with:
    * Familiarity.
    * Impression.
    * Age group.
    * Student vs non-student.
    * Travel mode.
    * Income.

**When was the last time you visited? (Q9)**
* Distribution (within week, month, year, never).
* Combine with frequency to spot inconsistencies (e.g., says “weekly” but last visit “more than a year ago”).

**How do you typically travel to the market? (Q7)**
* Mode split (car, walking, bus, bike, rideshare, etc.).
* Cross-tab with:
    * Distance (via ZIP or Lat/Long).
    * Age group.
    * Student vs non-student.
    * Household income.
    * Reasons for not visiting (parking, accessibility).

**Spend per visit (Q11)**
* Distribution (histogram / boxplot).
* Compare median/mean spend across:
    * Visit frequency.
    * Age group.
    * Student vs non-student.
    * Household income.
    * Impression and satisfaction.

**Exploring vs going to specific vendors (Q12)**
* Share who “explores stalls” vs “goes straight to specific vendors”.
* Cross-tab with:
    * Visit frequency (do regulars beeline more?).
    * Spend per visit (explorers vs targeted shoppers).
    * Satisfaction & impression.

# 4. Satisfaction & Experience Inside the Market

**4.1 Satisfaction Distributions**
- Analyze distributions for:
  - Product variety  
  - Cleanliness  
  - Vendor interactions  
  - Parking / accessibility  
  - Value for money  
- Compute:
  - Mean, median, mode  
  - % satisfied vs neutral vs dissatisfied  
- Visuals:
  - Side-by-side bar charts  
  - Radar chart of satisfaction dimensions  

**4.2 Compare Satisfaction Dimensions**
- Identify which dimensions rank highest or lowest.
- Check if issues like cleanliness or parking consistently score lower.

**4.3 Satisfaction vs Demographics**
Compare each satisfaction metric across:
- Age groups  
- Gender  
- Race  
- Household income  
- Student vs non-student  
- Household size  

**4.4 Satisfaction vs Behavior**
Analyze satisfaction as a function of:
- Visit frequency  
- Spend per visit  
- Travel mode  
- Whether they explore stalls or visit specific vendors  
- Grocery delivery usage  

**4.5 Dissatisfaction Follow-Up (Open Text)**
For respondents who reported “somewhat/extremely dissatisfied”:
- Thematic coding of explanations  
- Identify top themes  
- Compare themes across demographics  
- Connect themes back to specific satisfaction dimensions

# 5. Other Public Markets & Comparisons

**5.1 Visits to Other Markets**
- Proportion who have visited other markets  
- Breakdown of which markets (open text)  
- Compare visitors vs non-visitors on:
  - Satisfaction at WSM  
  - Visit frequency  
  - Spend per visit  
  - Event/program interest  

**5.2 Reasons for Shopping at Public Markets**(Multi-Select)
- % selecting each motivator  
- Group motivations into themes:
  - Freshness  
  - Community  
  - Cultural experience  
  - Price/value  
- Compare motivations by:
  - Age  
  - Student status  
  - Visit frequency  

**5.3 Likes/Dislikes About Other Markets (Open Text)**
- Thematic coding  
- Identify common features of successful markets  
- Compare criticisms of other markets with WSM’s issues  

# 6. Motivations & Barriers for the West Side Market

**6.1 Motivations (Multi-Select)**
- % selecting each reason for visiting  
- Compare motivations across:
  - Spend per visit  
  - Visit frequency  
  - Age  
  - Student vs non-student  

**6.1.1 Additional Motivations (Open Text)**
- Code unique motivations (e.g., history, atmosphere)

**6.2 Barriers (Multi-Select)**
- % selecting each barrier:
  - Parking  
  - Distance  
  - Hours  
  - Pricing  
  - Accessibility  
  - Convenience  
- Cross-tabs:
  - Frequent vs infrequent visitors  
  - Visitors vs non-visitors  
  - Travel mode  
  - Student status  
  - Household income  

**6.2.1 Additional Barriers (Open Text)**
- Identify themes not captured by predefined categories  

# OTHERS

# 7. Student-Focused Analysis

## 7.1 Student vs Non-Student
- Proportion of student respondents  
- Compare:
  - Visit frequency  
  - Typical spend  
  - Travel mode  
  - Event interest levels  
  - Volunteer/donor interest  
  - Grocery delivery usage  

## 7.2 Breakdown by School
- Distribution across schools (CWRU, CSU, Tri-C, etc.)
- Compare student subgroups by:
  - Visit patterns  
  - Event interest  
  - Spending behavior  

---

# 8. Programming & Event Interest

## 8.1 Interest Levels (Top-2-Box)
Calculate % “Very interested” + “Somewhat interested” for:
- Live music  
- Seasonal food festivals  
- Cooking/nutrition classes  
- Family-friendly events  
- Evening happy hours  
- Cultural celebrations  
- Loyalty/rewards program  
- Meal/grocery delivery  

Rank programs from most to least appealing.

## 8.2 Segment-Level Insights
Compare interest levels by:
- Age group  
- Household size  
- Student vs non-student  
- Frequent vs infrequent visitors  

## 8.3 Event Interest Clustering
Identify clusters such as:
- Entertainment-focused  
- Education-focused  
- Convenience-focused  
- Family-oriented  

## 8.4 Event Suggestions (Open Text)
- Thematic coding of new ideas not provided in survey options  

---

# 9. Timing & Logistics

## 9.1 Best Days to Visit (Multi-Select)
- Count and rank preferred days of the week  
- Cross-analyze with:
  - Visit frequency  
  - Student status  
  - Work/school patterns  

---

# 10. Service & Product Experience

## 10.1 Vendor Interactions (Open Text)
- Classify as positive / neutral / negative  
- Identify recurring themes (friendliness, rudeness, quality, etc.)
- Compare themes across:
  - Demographics  
  - Satisfaction scores  

## 10.2 Product Experiences (Open Text)
- Categorize by:
  - Freshness  
  - Pricing  
  - Quality issues  
  - Unique finds  
- Link themes with satisfaction and spend levels  

## 10.3 Additional Comments (Open Text)
- Capture broader sentiments  
- Identify repeated suggestions or criticisms  

---

# 11. Grocery Shopping & Prepared Food Habits

## 11.1 Grocery Delivery Usage
- Frequency distribution  
- Compare:
  - Visit frequency  
  - Demographics  
  - Interest in meal/grocery delivery programs  

## 11.2 Grocery Store Choices (Multi-Select)
Compute % who shop at:
- Walmart  
- Aldi  
- Giant Eagle  
- Heinen’s  
- Dave’s  
- Farmers’ markets  
- West Side Market  
- Fairfax Market  

Compare patterns between:
- WSM shoppers vs non-WSM shoppers  
- Different demographic groups  

## 11.3 Prepared Food Sources (Multi-Select)
Analyze % using:
- Restaurants/takeout  
- DoorDash/Grubhub/etc.  
- Grocery store hot bars  
- WSM prepared foods  
- Meal prep services  

Cross-tab with:
- Event interest levels  
- Visit frequency  
- Age/student status  

---

# 12. Communication & Outreach

## 12.1 How Respondents Hear About Local Events
- % selecting each channel (social media, word of mouth, email, etc.)
- Analyze differences by:
  - Age  
  - Student status  
  - Event interest clusters  

## 12.2 Open Text (“Other” Sources)
- Code emerging channels (e.g., neighborhood Facebook groups, campus bulletin boards)

## 12.3 Email & Contact Permissions
- % who provided email  
- % willing to be contacted  
- Identify high-value segments:
  - Frequent visitors  
  - High satisfaction  
  - High event interest  
  - Potential volunteers/donors  

---

# 13. Demographic Analysis

## 13.1 Basic Distributions
Summaries for:
- Household size  
- Household income  
- Gender  
- Race  

## 13.2 Demographics vs Behavior
Compare demographic groups on:
- Visit frequency  
- Spend per visit  
- Satisfaction  
- Event interest  
- Grocery habits  

## 13.3 Intersectional Insights
Examples:
- Low-income frequent visitors vs high-income infrequent visitors  
- Students of color vs non-students  
- Larger households vs singles  

---

# 14. Advanced Exploratory Ideas

## 14.1 Factor Analysis / PCA
- Satisfaction items → latent factors  
- Event interest items → clusters of preferences  

## 14.2 Behavioral Segmentation / Clustering
Potential clusters:
- Local loyalists  
- Occasional tourists  
- Price-sensitive shoppers  
- Convenience-driven shoppers  

## 14.3 Predictive Explorations
(Still exploratory, not causal)
- Predict satisfaction using demographics + behavior  
- Predict event interest using shopping habits + visit frequency  