# CODTECH Internship – Task 1

## 📊 Big Data Analysis using PySpark

**Dataset**: [cities.csv](https://people.sc.fsu.edu/~jburkardt/data/csv/cities.csv)

### ✅ Steps Performed:
- Loaded and cleaned the dataset
- Renamed important columns
- Displayed sample rows and schema
- Generated basic statistics
- Filtered and sorted the data

---

## 🔍 Key Insights

1. The dataset contains city names and their geographic locations in Degrees, Minutes, and Seconds format.
2. Renaming and cleaning made the data easier to analyze.
3. Cities with Latitude > 40 are mostly in northern U.S. (e.g., New York, Chicago).
4. Sorting by Longitude revealed west coast cities like San Francisco.
5. PySpark proved efficient for analyzing even structured tabular data.

---


In [8]:
from pyspark.sql import SparkSession

# ✅ Start Spark session
spark = SparkSession.builder.appName("CityDataAnalysis").getOrCreate()

In [9]:
# ✅ Load the dataset (make sure cities.csv is in C:/Users/Sanjay/)
df = spark.read.csv("D:\\Project\\task1\\cities.csv", header=True, inferSchema=True)

In [10]:
# ✅ Step 1: Show original column names
print("📌 Original column names:")
print(df.columns)

📌 Original column names:
['LatD', ' "LatM"', ' "LatS"', ' "NS"', ' "LonD"', ' "LonM"', ' "LonS"', ' "EW"', ' "City"', ' "State"']


In [11]:
# ✅ Step 2: Clean up column names (remove extra spaces/quotes)
df = df.toDF(*[col.strip().replace('"', '') for col in df.columns])

In [12]:
# ✅ Step 3: Rename columns to readable names
df = df.withColumnRenamed("LatD", "Latitude").withColumnRenamed("LonD", "Longitude")

In [13]:
# ✅ Step 4: Show the first few rows
print("📌 Preview of dataset:")
df.show(5)

📌 Preview of dataset:
+--------+----+----+----+---------+----+----+----+------------------+-----+
|Latitude|LatM|LatS|  NS|Longitude|LonM|LonS|  EW|              City|State|
+--------+----+----+----+---------+----+----+----+------------------+-----+
|    41.0| 5.0|59.0| "N"|     80.0|39.0| 0.0| "W"|      "Youngstown"|   OH|
|    42.0|52.0|48.0| "N"|     97.0|23.0|23.0| "W"|         "Yankton"|   SD|
|    46.0|35.0|59.0| "N"|    120.0|30.0|36.0| "W"|          "Yakima"|   WA|
|    42.0|16.0|12.0| "N"|     71.0|48.0| 0.0| "W"|       "Worcester"|   MA|
|    43.0|37.0|48.0| "N"|     89.0|46.0|11.0| "W"| "Wisconsin Dells"|   WI|
+--------+----+----+----+---------+----+----+----+------------------+-----+
only showing top 5 rows


In [14]:
# ✅ Step 5: Print schema
print("📌 Schema of dataset:")
df.printSchema()

📌 Schema of dataset:
root
 |-- Latitude: double (nullable = true)
 |-- LatM: double (nullable = true)
 |-- LatS: double (nullable = true)
 |-- NS: string (nullable = true)
 |-- Longitude: double (nullable = true)
 |-- LonM: double (nullable = true)
 |-- LonS: double (nullable = true)
 |-- EW: string (nullable = true)
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)



In [15]:
# ✅ Step 6: Count the number of records
print("📌 Total number of cities:")
print(df.count())

📌 Total number of cities:
128


In [16]:
# ✅ Step 7: Describe statistics
print("📌 Summary statistics:")
df.describe().show()

📌 Summary statistics:
+-------+-----------------+-----------------+------------------+----+------------------+------------------+------------------+----+-------------+-----+
|summary|         Latitude|             LatM|              LatS|  NS|         Longitude|              LonM|              LonS|  EW|         City|State|
+-------+-----------------+-----------------+------------------+----+------------------+------------------+------------------+----+-------------+-----+
|  count|              128|              128|               128| 128|               128|               128|               128| 128|          128|  128|
|   mean|       38.8203125|        30.765625|        27.4921875|NULL|             93.25|        27.7421875|        26.9609375|NULL|         NULL| NULL|
| stddev|5.200595958808149|16.42615754139729|18.977813857217924|NULL|15.466499229793303|16.927937163000344|18.727806747477565|NULL|         NULL| NULL|
|    min|             26.0|              1.0|               0.0| "

In [17]:
# ✅ Step 8: Filter cities with Latitude > 40
print("📌 Cities with Latitude > 40:")
df.filter(df["Latitude"] > 40).show()

📌 Cities with Latitude > 40:
+--------+----+----+----+---------+----+----+----+------------------+-----+
|Latitude|LatM|LatS|  NS|Longitude|LonM|LonS|  EW|              City|State|
+--------+----+----+----+---------+----+----+----+------------------+-----+
|    41.0| 5.0|59.0| "N"|     80.0|39.0| 0.0| "W"|      "Youngstown"|   OH|
|    42.0|52.0|48.0| "N"|     97.0|23.0|23.0| "W"|         "Yankton"|   SD|
|    46.0|35.0|59.0| "N"|    120.0|30.0|36.0| "W"|          "Yakima"|   WA|
|    42.0|16.0|12.0| "N"|     71.0|48.0| 0.0| "W"|       "Worcester"|   MA|
|    43.0|37.0|48.0| "N"|     89.0|46.0|11.0| "W"| "Wisconsin Dells"|   WI|
|    49.0|52.0|48.0| "N"|     97.0| 9.0| 0.0| "W"|        "Winnipeg"|   MB|
|    48.0| 9.0| 0.0| "N"|    103.0|37.0|12.0| "W"|       "Williston"|   ND|
|    41.0|15.0| 0.0| "N"|     77.0| 0.0| 0.0| "W"|    "Williamsport"|   PA|
|    47.0|25.0|11.0| "N"|    120.0|19.0|11.0| "W"|       "Wenatchee"|   WA|
|    41.0|25.0|11.0| "N"|    122.0|23.0|23.0| "W"|         

In [18]:
# ✅ Step 9: Sort by Longitude (descending)
print("📌 Cities sorted by Longitude:")
df.orderBy("Longitude", ascending=False).show(5)

📌 Cities sorted by Longitude:
+--------+----+----+----+---------+----+----+----+-------------+-----+
|Latitude|LatM|LatS|  NS|Longitude|LonM|LonS|  EW|         City|State|
+--------+----+----+----+---------+----+----+----+-------------+-----+
|    49.0|16.0|12.0| "N"|    123.0| 7.0|12.0| "W"|  "Vancouver"|   BC|
|    44.0|56.0|23.0| "N"|    123.0| 1.0|47.0| "W"|      "Salem"|   OR|
|    41.0|25.0|11.0| "N"|    122.0|23.0|23.0| "W"|       "Weed"|   CA|
|    38.0|26.0|23.0| "N"|    122.0|43.0|12.0| "W"| "Santa Rosa"|   CA|
|    47.0|14.0|24.0| "N"|    122.0|25.0|48.0| "W"|     "Tacoma"|   WA|
+--------+----+----+----+---------+----+----+----+-------------+-----+
only showing top 5 rows
