# CODTECH Internship – Task 1

## 📊 Big Data Analysis using PySpark

**Dataset**: [cities.csv](https://people.sc.fsu.edu/~jburkardt/data/csv/cities.csv)

### ✅ Steps Performed:
- Loaded and cleaned the dataset
- Renamed important columns
- Displayed sample rows and schema
- Generated basic statistics
- Filtered and sorted the data

---

## 🔍 Key Insights

1. The dataset contains city names and their geographic locations in Degrees, Minutes, and Seconds format.
2. Renaming and cleaning made the data easier to analyze.
3. Cities with Latitude > 40 are mostly in northern U.S. (e.g., New York, Chicago).
4. Sorting by Longitude revealed west coast cities like San Francisco.
5. PySpark proved efficient for analyzing even structured tabular data.

---
*Submitted by vedika gupta  –  Data Analytics*


In [None]:
from pyspark.sql import SparkSession

# ✅ Start Spark session
spark = SparkSession.builder.appName("CityDataAnalysis").getOrCreate()

In [None]:
# ✅ Load the dataset (make sure cities.csv is in C:/Users/Sanjay/)
df = spark.read.csv("file:///C:/Users/Sanjay/cities.csv", header=True, inferSchema=True)

In [None]:
# ✅ Step 1: Show original column names
print("📌 Original column names:")
print(df.columns)

In [None]:
# ✅ Step 2: Clean up column names (remove extra spaces/quotes)
df = df.toDF(*[col.strip().replace('"', '') for col in df.columns])

In [None]:
# ✅ Step 3: Rename columns to readable names
df = df.withColumnRenamed("LatD", "Latitude").withColumnRenamed("LonD", "Longitude")

In [None]:
# ✅ Step 4: Show the first few rows
print("📌 Preview of dataset:")
df.show(5)

In [None]:
# ✅ Step 5: Print schema
print("📌 Schema of dataset:")
df.printSchema()

In [None]:
# ✅ Step 6: Count the number of records
print("📌 Total number of cities:")
print(df.count())

In [None]:
# ✅ Step 7: Describe statistics
print("📌 Summary statistics:")
df.describe().show()

In [None]:
# ✅ Step 8: Filter cities with Latitude > 40
print("📌 Cities with Latitude > 40:")
df.filter(df["Latitude"] > 40).show()

In [None]:
# ✅ Step 9: Sort by Longitude (descending)
print("📌 Cities sorted by Longitude:")
df.orderBy("Longitude", ascending=False).show(5)