<a href="https://colab.research.google.com/github/zsolthavanna/debreceninspace/blob/main/nlp_project_anime.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

setting everything up

In [7]:
!pip install nltk
!pip install scikit-learn




preparing the data

In [8]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#download NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
#necessary data for tokenization
nltk.download('punkt_tab')

#loading dataset
df = pd.read_csv('/content/sample_data/anime_with_synopsis.csv', names=['ID', 'Title', 'Rating', 'Genres', 'Synopsis'])

#Preprocess text function
def preprocess_text(text):
    #check if text is a string if not converting to string
    if not isinstance(text, str):
        text = str(text)
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)  #remove special characters and punctuation
    tokens = nltk.word_tokenize(text)  #tokenization
    text = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return " ".join(text)

#preprocessing the synopsis
df['Cleaned_Synopsis'] = df['Synopsis'].apply(preprocess_text)
df.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Unnamed: 0,ID,Title,Rating,Genres,Synopsis,Cleaned_Synopsis
0,MAL_ID,Name,Score,Genres,sypnopsis,sypnopsis
1,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space","In the year 2071, humanity has colonized sever...",year humanity colonized several planet moon so...
2,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space","other day, another bounty—such is the life of ...",day another bountysuch life often unlucky crew...
3,6,Trigun,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen","Vash the Stampede is the man with a $$60,000,0...",vash stampede man bounty head reason he mercil...
4,7,Witch Hunter Robin,7.27,"Action, Mystery, Police, Supernatural, Drama, ...",ches are individuals with special powers like ...,ches individual special power like esp telekin...


To calculate similarity, convert the cleaned synopses into vectors using TF-IDF:


In [10]:
#converting the synopsis into TF-IDF vectors
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['Cleaned_Synopsis'])

#checking the shape of the resulting matrix
print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")


TF-IDF matrix shape: (16215, 48530)


Use cosine similarity to measure the similarity between anime synopses:

> Добавить блок с цитатой



In [11]:
#compute the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

#function to recommend similar anime
def recommend_anime(title, cosine_sim=cosine_sim):
    idx = df.index[df['Title'] == title].tolist()[0]  #find index of anime
    sim_scores = list(enumerate(cosine_sim[idx]))  #get similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)  #sort by similarity
    sim_scores = sim_scores[1:6]  #top 5 similar anime excluding itself
    anime_indices = [i[0] for i in sim_scores]
    return df['Title'].iloc[anime_indices]


print("Recommendations for 'Cowboy Bebop':")
print(recommend_anime("Cowboy Bebop"))


Recommendations for 'Cowboy Bebop':
2        Cowboy Bebop: Tengoku no Tobira
3150                Ginga Senpuu Braiger
12388                    Chibikko Cowboy
3045     Cowboy Bebop: Yose Atsume Blues
8096                   Be-Bop Kaizokuban
Name: Title, dtype: object


this for UI

In [12]:
!pip install streamlit
!pip install pyngrok




writing an app with some changes in recomendation system


In [13]:
%%writefile app.py
import streamlit as st
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#download necessary NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

#loading dataset
df = pd.read_csv('/content/sample_data/anime_with_synopsis.csv', names=['ID', 'Title', 'Rating', 'Genres', 'Synopsis'])

#preprocess text function
def preprocess_text(text):
    # Check if text is a string, if not convert to string
    if not isinstance(text, str):
        text = str(text)
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))
    text = text.lower()  #lowercasing
    text = re.sub(r'[^a-zA-Z\s]', '', text)  #remove special characters and punctuation
    tokens = nltk.word_tokenize(text)  #tokenization
    text = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return " ".join(text)

#combining genres and synopsis for better matching
df['Combined_Text'] = df['Synopsis'].fillna('') + " " + df['Genres'].fillna('')

#apply preprocessing to the combined text
df['Cleaned_Text'] = df['Combined_Text'].apply(preprocess_text)

#convert the combined text into TF-IDF vectors
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['Cleaned_Text'])

#compute the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

#function to recommend anime with filtering
def recommend_anime(title, cosine_sim=cosine_sim, threshold=0.8):
    try:
        idx = df.index[df['Title'].str.lower() == title.lower()].tolist()[0]  #find index of anime
        input_genres = set(df['Genres'].iloc[idx].split(", "))  #extract genres of input anime

        sim_scores = list(enumerate(cosine_sim[idx]))  #get similarity scores
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)  #sort by similarity
        sim_scores = [score for score in sim_scores if score[1] < threshold]  #avoid overly similar titles

        recommendations = []
        for i, score in sim_scores:
            candidate_title = df['Title'].iloc[i]
            candidate_genres = set(df['Genres'].iloc[i].split(", "))

            #filter out sequels or alternate titles
            if title.lower() not in candidate_title.lower():
                #ensure overlapping genres for contextual relevance
                if len(input_genres & candidate_genres) > 0:
                    recommendations.append((candidate_title, score, df['Genres'].iloc[i]))
            if len(recommendations) >= 5:  #limit to top 5 recommendations
                break

        return recommendations
    except IndexError:
        return []  #return empty list if title not found

#streamlit UI
st.title("Anime Recommendation System")

anime_title = st.text_input("Enter an anime title:")
if st.button("Recommend"):
    recommendations = recommend_anime(anime_title)
    if recommendations:
        st.write("Recommended Anime:")
        for rec in recommendations:
            st.write(f"- **{rec[0]}** (Genres: {rec[2]}, Similarity: {rec[1]:.2f})")
    else:
        st.write("Sorry, no recommendations found. Please check the title and try again.")


Overwriting app.py


ngrok token

In [14]:
!ngrok authtoken 2qH9BygG8eOu3Oa5WDiNkIRtkzM_3SccjNR6kVCwu1X7ZN3Ks

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


Set up a streamlit app

In [15]:
from pyngrok import ngrok, conf

#configure tunnel with HTTP/2
conf.get_default().http2_tunnel = True

#set up a tunnel to the streamlit app
public_url = ngrok.connect(addr="8501")
print(f"Streamlit app is live at: {public_url}")

Streamlit app is live at: NgrokTunnel: "https://a32c-104-154-132-255.ngrok-free.app" -> "http://localhost:8501"


running the Streamlit app in the background

In [18]:
!streamlit run app.py &


Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://104.154.132.255:8501[0m
[0m
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Packag