<a href="https://colab.research.google.com/github/vigneshpadala/E-commerce/blob/main/imdbtopmovies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**1.pandas**

**1.1 Importing and Basic Setup**

In [None]:
from google.colab import files
uploaded = files.upload()

Saving imdb_top_movies.csv to imdb_top_movies.csv


In [None]:
import pandas as pd
import io
df = pd.read_csv(io.BytesIO(uploaded['imdb_top_movies.csv']))
df.head()        # First 5 rows
df.tail()        # Last 5 rows

Unnamed: 0,Rank,Title,Year,Rating,Duration,IMDb URL,Image URL
245,246,The Grapes of Wrath,1940,8.1,2h 9m,https://www.imdb.com/title/tt0032551/,https://m.media-amazon.com/images/M/MV5BZDgzZj...
246,247,To Be or Not to Be,1942,8.1,1h 39m,https://www.imdb.com/title/tt0035446/,https://m.media-amazon.com/images/M/MV5BMTA5OT...
247,248,Gangs of Wasseypur,2012,8.2,5h 21m,https://www.imdb.com/title/tt1954470/,https://m.media-amazon.com/images/M/MV5BMTc5Nj...
248,249,Drishyam,2015,8.2,2h 43m,https://www.imdb.com/title/tt4430212/,https://m.media-amazon.com/images/M/MV5BM2NmMG...
249,250,The Help,2011,8.1,2h 26m,https://www.imdb.com/title/tt1454029/,https://m.media-amazon.com/images/M/MV5BMTM5OT...


**1.2 Basic Data Exploration**

In [None]:
df.info()         # Column names & data types

In [None]:
df.describe()     # Summary statistics for numeric columns

In [None]:
df.shape          # (columns, rows)

In [None]:
df.columns       # all columns

Index(['Rank', 'Title', 'Year', 'Rating', 'Duration', 'IMDb URL', 'Image URL'], dtype='object')

In [None]:
df.dtypes        # datatypes

**1.3 Select specific columns or rows**

In [None]:
df['IMDb URL']                 # Select one column

In [None]:
df[['Title', 'Year']]          # Select multiple columns

In [None]:
df.iloc[0]                     # First row (by position)

In [None]:
df.loc[100]                    # Row with index

In [None]:
df.iloc[0:5]                   # slicing

**1.4 Filter data**

Filtering data in Pandas refers to the process of selecting a subset of rows or columns from a DataFrame based on specific conditions or criteria. This is a fundamental operation in data analysis, allowing you to focus on relevant portions of your data.

**There are two primary ways to filter data in Pandas:**

Filtering by Labels (Column Names/Index Labels):

1.This method primarily uses the DataFrame.filter() function.

2.You can select columns or rows based on their labels (names or index values).

3.The filter() method offers options like items (for exact label matches), like (for partial string matches), and regex (for regular expression patterns) to define your selection criteria.

4.You can specify the axis parameter to indicate whether you are filtering rows (axis=0 or 'index') or columns (axis=1 or 'columns').


In [None]:
high_rated = df[df['Rating'] >= 8.0]
print(high_rated[['Title', 'Rating']])

In [None]:
new_movies = df[df['Year'] > 2015]
print(new_movies[['Title', 'Year']])

**1.5 Movies from specific years**

In [None]:
selected_years = df[df['Year'].isin([2010, 2012, 2015])]
print(selected_years[['Title', 'Year']])

                        Title  Year
13                  Inception  2010
51           Django Unchained  2012
70      The Dark Knight Rises  2012
92                Toy Story 3  2010
93                   The Hunt  2012
96                  Incendies  2010
135            Shutter Island  2010
170                Inside Out  2015
184        Mad Max: Fury Road  2015
198  How to Train Your Dragon  2010
218                 Spotlight  2015
225                      Room  2015
247        Gangs of Wasseypur  2012
248                  Drishyam  2015


**1.6 Movies whose title starts with "The"**

In [None]:
the_movies = df[df['Title'].str.startswith("The")]
print(the_movies[['Title']])

**1.7 Movies with missing IMDb URL**

In [None]:
missing_url = df[df['IMDb URL'].isnull()]
print(missing_url)

**1.8 Movies NOT from year 2020**

In [None]:
not_2020 = df[df['Year'] != 2020]
print(not_2020[['Title', 'Year']])

**1.9 Movies from year 2020**

In [None]:
not_2020 = df[df['Year'] == 2020]
print(not_2020[['Title', 'Year']])

**2.matplotlib**




**2.1 Import matplotlib libraries**

In [None]:
import matplotlib.pyplot as plt

**2.2 Basic Matplotlib Operations**

In [None]:
top10 = df.sort_values(by='Rating', ascending=False).head(10)
plt.figure(figsize=(10, 6))
plt.xlabel('Rating')
plt.ylabel('Average Rating')
plt.title('Average IMDb Rating by Year')
plt.grid(True)
plt.show()

**2.3 Scatter plot — Duration vs Rating**

In [None]:
plt.figure(figsize=(8,5))
plt.scatter(df['Duration'], df['Rating'], color='blue', alpha=0.6)
plt.xlabel('Duration (minutes)')
plt.ylabel('Rating')
plt.title('Movie Duration vs Rating')
plt.show()

**2.4 Histogram — Distribution of ratings**

In [None]:
plt.figure(figsize=(8,5))
plt.hist(df['Rating'], bins=10, color='orange', edgecolor='black')
plt.xlabel('Rating')
plt.ylabel('Number of Movies')
plt.title('Distribution of Movie Ratings')
plt.show()


**2.5 Pie chart — Movies per decade**

In [None]:
df['Decade'] = (df['Year'] // 10) * 10
decade_counts = df['Decade'].value_counts()

plt.figure(figsize=(6,6))
plt.pie(decade_counts, labels=decade_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Movies by Decade')
plt.show()