# Extracting lists to Disavow links on Gooogle Search Console

My website ([LucidGen.com](https://lucidgen.com)) is getting many backlinks of unknown origin. Someone is attacking my website by giving me many bad backlinks. These things can badly affect my website ranking on Google Search. So I need to find bad backlinks in all of the backlinks to tell with Google that is not mine and require Google to exclude it from counting backlinks for my website.

**How I do it:**
I'll export data from Ahref (backlinks), MOZ (Spam Score), Search Console (Top linking sites) and combine them into one dataset.
I also gather some lists for conditions such as Domain White List, Posts Sitemap, Bad Keywords for Domain, and Bad Keywords for Path.
I'll extract Manual Check by use conditions:
- Domain and path not in bad keywords, path not contain my post's path
- Domain Authority (DA) > 15 and Spam Score (SS) < DA*20% < 10%
- Domain Rating (DR) > 30 and Domain Traffic (DT) > 3000

**Summary of results:**
I get three final lists:
- **Disavow list:** I'll send it to Google Search Console.
- **DMCA Report list:** I'll report to Google Copyright Removal because they infringe my content.
- **Manual Check List:** I need to check because they don't match any conditions. I can use Ahref Quick Batch Analysis to find their DA and add bad backlinks to Disavow list.


## Exploring datasets & lists

### Exploring datasets

#### Ahref dataset
I'll export it from Ahref > Site Explorer > Backlinks.

In [1]:
# Exploring Ahref dataset
import pandas as pd
explore_dataset_ahref = pd.read_csv('Input/Dataset_Ahref.csv')
explore_dataset_ahref

Unnamed: 0,Referring page title,Referring page URL,Language,Platform,Referring page HTTP code,Domain rating,Domain traffic,Referring domains,Linked domains,External links,...,Content,Nofollow,UGC,Sponsored,Rendered,Raw,Lost status,First seen,Last seen,Lost
0,WordPress là gì? Vì sao nên dùng WordPress để ...,https://thachpham.com/wordpress/wordpress-tuto...,vi,wordpress,200,67.0,35044,24,15,22,...,False,False,False,False,False,True,,2021-07-18 01:18:57,2021-12-02 06:57:36,
1,Học làm website với WordPress cơ bản,https://thachpham.com/wordpress/wordpress-tuto...,,wordpress,200,67.0,35044,55,12,19,...,False,False,False,False,False,True,,2021-07-22 09:25:45,2021-11-24 14:09:50,
2,Thach Pham – Chuyên trang chia sẻ các kiến thứ...,https://thachpham.com/,vi,wordpress,200,67.0,35044,560,11,14,...,False,False,False,False,False,True,,2021-10-11 09:11:28,2021-11-16 18:57:31,
3,Cài website WordPress trên localhost dùng XAMPP,https://thachpham.com/wordpress/wordpress-tuto...,vi,wordpress,200,67.0,35044,30,12,17,...,False,False,False,False,False,True,,2021-07-17 20:20:22,2021-12-03 07:32:02,
4,"Hướng dẫn cách Viết Chữ in đậm, in nghiêng ...",https://webhoanggia.com/huong-da%CC%83n-cach-v...,vi,"ecommerce, wordpress",200,25.0,1110,0,15,23,...,True,False,False,False,False,True,,2021-09-13 13:20:15,2021-11-30 05:53:41,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8936,Verkkosivun katsaus | lucidgen.com,https://www.website-scan.com/fi/www/lucidgen.com,fi,,200,28.0,5023,0,10,65,...,False,True,False,False,False,True,,2021-07-06 08:39:12,2021-10-10 01:35:52,
8937,7 themes WordPress trả phí có giá cao nhất,https://thachpham.com/wordpress/themes-wordpre...,vi,wordpress,200,67.0,35044,0,25,44,...,False,False,False,False,False,True,,2021-07-19 03:30:44,2021-12-01 19:12:38,
8938,Hiểu và sử dụng tốt sức mạnh của Link nội để l...,https://thachpham.com/seo/hieu-va-su-dung-tot-...,vi,wordpress,200,67.0,35044,0,13,24,...,False,False,False,False,False,True,,2021-07-20 02:40:42,2021-11-29 04:47:31,
8939,Merge Multiple Csv Files Python,https://www.bestschoolrankings.com/school/merg...,en,,200,2.8,6185,0,23,33,...,False,True,False,False,False,True,,2021-12-01 14:39:51,2021-12-01 14:39:51,


I'll use `Referring page URL` and `Domain Rating` from Ahref dataset.

#### Google Seach Console dataset
I'll export it from Search Console > Links > Top linking sites.

In [2]:
# Exploring Console dataset
explore_dataset_console = pd.read_csv('Input/Dataset_Console.csv')
explore_dataset_console

Unnamed: 0,Site,Linking pages,Target pages
0,huuthuan.net,9534,1
1,thachpham.com,4929,1
2,medium.com,1000,108
3,alotoi.com,490,1
4,linkhay.com,226,95
...,...,...,...
281,meohayplus.com,1,1
282,sketchfab.com,1,1
283,loginguides.net,1,1
284,logincollector.uk,1,1


I'll use only `Site` from Console dataset.

#### MOZ dataset
I'll export it from Moz Pro > Link Research > Spam Score.
Because the MOZ dataset has many redundant rows, I have to use another way to view, and I will delete these rows later.

In [3]:
#Exploring Moz dataset
from csv import reader
open_file = open('Input/Dataset_Moz.csv')
read_file = reader(open_file)
explore_dataset_moz = list(read_file)

for row in explore_dataset_moz[0:10]:
    print(row)

['\ufeff------------------------------------------------------------------------']
['Inbound Links by Spam Score for', 'lucidgen.com']
['Scope', 'pld']
['Sorted by', 'spam_score']
['------------------------------------------------------------------------']
[]
['URL', 'Title', 'Anchor Text', 'Spam Score', 'DA', 'Date First Seen', 'Date Crawled']
['www.10k.pw/domain-list-511', '', 'lucidgen.com', '96%', '3', '2021-07-27', '2021-10-30']
['www.100k.pw/domain-list-511', '', 'lucidgen.com', '96%', '2', '2021-07-25', '2021-10-24']
['www.new.net.in/domain-list-58', 'Alexa top domain list ||  page  58', 'lucidgen.com', '88%', '20', '2021-05-10', '2021-09-24']


I'll use `URL`, `Spam Score`, and `DA` from MOZ dataset.

### Exploring the list of uses for the condition

I'll prepare some simple lists with one column and no header:
- **Domain White List:** I use it to exclude the domains from conditions.
- **Posts Sitemap:** I export it from the website sitemap.
- **Bad Keywords for Domain:** For example, top, free, list, domain, .pw, etc.
- **Bad keywords for Path:** For example, list, top, web-site-no, etc.

In [4]:
#Exploring the list of uses for the condition.
list_whitelist = pd.read_csv('Input/List_DomainWhitelist.csv',header=None).values.tolist()
list_sitemap = pd.read_csv('Input/List_PostsSitemap.csv',header=None).values.tolist()
list_key_domain = pd.read_csv('Input/List_BadKeyDomain.csv',header=None).values.tolist()
list_key_path = pd.read_csv('Input/List_BadKeyPath.csv',header=None).values.tolist()

print('Domain whitelist:')
print(pd.DataFrame(list_whitelist))
print('\nOur post sitemap:')
print(pd.DataFrame(list_sitemap))
print('\nBad keywords for domain:')
print(pd.DataFrame(list_key_domain))
print('\nBad keywords for path:')
print(pd.DataFrame(list_key_path))

Domain whitelist:
                      0
0              about.me
1    analyticsmania.com
2           behance.net
3         business.site
4     cellphones.com.vn
..                  ...
696    knowyourmeme.com
697           houzz.com
698             moz.com
699        academia.edu
700        stanford.edu

[701 rows x 1 columns]

Our post sitemap:
                                                     0
0    https://lucidgen.com/en/how-to-add-watermark-t...
1    https://lucidgen.com/cach-tat-che-do-da-xem-va...
2    https://lucidgen.com/khang-trang-khong-duoc-ph...
3    https://lucidgen.com/ban-tai-khoan-quang-cao-f...
4    https://lucidgen.com/auto-phan-mem-ket-ban-fac...
..                                                 ...
195  https://lucidgen.com/cai-mail-cong-ty-vao-outl...
196  https://lucidgen.com/gia-lap-may-tinh-casio-fx...
197       https://lucidgen.com/co-nen-mua-ten-mien-vn/
198  https://lucidgen.com/loi-lien-ket-ban-theo-doi...
199   https://lucidgen.com/chuyen-ladipage-ve-

## Converting the datasets to a list of lists

In [5]:
# Converting the datasets to a list of lists
dataset_ahref = explore_dataset_ahref.values.tolist()
dataset_console = explore_dataset_console.values.tolist()
dataset_moz = explore_dataset_moz[7:]

print('First data row in Ahref:')
print(dataset_ahref[0])
print('\nFirst data row in Console:')
print(dataset_console[0])
print('\nFirst data row in Moz:')
print(dataset_moz[0])

First data row in Ahref:
['WordPress là gì? Vì sao nên dùng WordPress để làm website?', 'https://thachpham.com/wordpress/wordpress-tutorials/wordpress-la-gi-va-gioi-thieu.html', 'vi', 'wordpress', 200, 67.0, 35044, 24, 15, 22, 2187, 60, 'https://lucidgen.com/', nan, 'Lucid Gen', nan, 'text', False, False, False, False, False, True, nan, '2021-07-18 01:18:57', '2021-12-02 06:57:36', nan]

First data row in Console:
['huuthuan.net', 9534, 1]

First data row in Moz:
['www.10k.pw/domain-list-511', '', 'lucidgen.com', '96%', '3', '2021-07-27', '2021-10-30']


## Extracting the required data for datasets

I need to combine all of the datasets into one with columns: Domain, URL, Path, Domain Rating (DR), Domain Authority (DA), Spam Score (SS). So I need to extract some required columns from the datasets.
- **Ahref:** Domain, URL, Path, DR.
- **MOZ:** Domain, URL, Path, DA, SS.
- **Console:** Domain.

And to use for extract conditions, I also extract paths from the posts sitemap.

In [6]:
# For extracting TLD domain (Root domain)
import tldextract #We need install tldextract module before import it
# For extracting path in URL
from urllib.parse import urlparse

# Extracting the required data for Ahref dataset
dataset_ahref_extracted = []

for row in dataset_ahref:
    page = row[1]
    domain = tldextract.extract(page).registered_domain
    dr = float(row[5])
    dt = int(row[6])
    path = urlparse(page).path
    url = urlparse(page).netloc + urlparse(page).path # Remove HTTP/HTTPS to combine datasets later
    data = [domain,url,path,dr,dt]
    if data not in dataset_ahref_extracted and domain != '':
        dataset_ahref_extracted.append(data)

# Extracting the required data for Moz dataset
dataset_moz_extracted = []

for row in dataset_moz:
    page = 'https://'+row[0] # urlparse requires HTTP/HTTPS for URL, but MOZ doesn't have.
    domain = tldextract.extract(page).registered_domain
    da = float(row[4])
    ss = row[3].replace('%','')
    if ss == '--':
        ss = 0.0
    else:
        ss = float(ss)
    path = urlparse(page).path
    url = urlparse(page).netloc + urlparse(page).path # Remove HTTP/HTTPS to combine datasets later
    data = [domain,url,path,da,ss]
    if data not in dataset_moz_extracted and domain != '':
        dataset_moz_extracted.append(data)

# Extracting the domains for Console dataset
dataset_console_domain = []

for row in dataset_console:
    domain = row[0]
    if domain not in dataset_console_domain:
        dataset_console_domain.append([domain])

# Extracting the paths for the posts sitemap list
for row in list_sitemap:
    row[0] = urlparse(row[0]).path.replace('en/','').replace('/','')

print('Ahref dataset:')
pd.DataFrame(dataset_ahref_extracted,columns=['Domain','URL','Path','Domain Rating (DR)','Domain Traffic (DT)'])

Ahref dataset:


Unnamed: 0,Domain,URL,Path,Domain Rating (DR),Domain Traffic (DT)
0,thachpham.com,thachpham.com/wordpress/wordpress-tutorials/wo...,/wordpress/wordpress-tutorials/wordpress-la-gi...,67.0,35044
1,thachpham.com,thachpham.com/wordpress/wordpress-tutorials/se...,/wordpress/wordpress-tutorials/serie-hoc-wordp...,67.0,35044
2,thachpham.com,thachpham.com/,/,67.0,35044
3,thachpham.com,thachpham.com/wordpress/wordpress-tutorials/ca...,/wordpress/wordpress-tutorials/cai-wordpress-l...,67.0,35044
4,webhoanggia.com,webhoanggia.com/huong-da%CC%83n-cach-viet-chu-...,/huong-da%CC%83n-cach-viet-chu-in-da%CC%A3m-in...,25.0,1110
...,...,...,...,...,...
5860,couponsale.in,couponsale.in/search/loiix.net,/search/loiix.net,50.0,170
5861,thachpham.com,thachpham.com/wordpress/wp-plugin/tao-trang-ba...,/wordpress/wp-plugin/tao-trang-ban-san-pham-so...,67.0,35044
5862,thachpham.com,thachpham.com/wordpress/themes-wordpress/theme...,/wordpress/themes-wordpress/themes-gia-cao-nha...,67.0,35044
5863,thachpham.com,thachpham.com/seo/hieu-va-su-dung-tot-suc-manh...,/seo/hieu-va-su-dung-tot-suc-manh-cua-link-noi...,67.0,35044


In [7]:
print('MOZ dataset:')
pd.DataFrame(dataset_moz_extracted,columns=['Domain','URL','Path','Domain Authority (DA)','Spam Score (SS)'])

MOZ dataset:


Unnamed: 0,Domain,URL,Path,Domain Authority (DA),Spam Score (SS)
0,10k.pw,www.10k.pw/domain-list-511,/domain-list-511,3.0,96.0
1,100k.pw,www.100k.pw/domain-list-511,/domain-list-511,2.0,96.0
2,new.net.in,www.new.net.in/domain-list-58,/domain-list-58,20.0,88.0
3,one.net.in,www.one.net.in/domain-list-58,/domain-list-58,22.0,88.0
4,love.net.in,www.love.net.in/domain-list-58,/domain-list-58,6.0,87.0
...,...,...,...,...,...
7359,wordpress.com,minhhieuhcm.wordpress.com/category/di-chuyen/,/category/di-chuyen/,2.0,0.0
7360,wordpress.com,minhhieuhcm.wordpress.com/2016/12/21/huong-dan...,/2016/12/21/huong-dan-giai-bai-tap-dinh-gia-co...,2.0,0.0
7361,wordpress.com,minhhieuhcm.wordpress.com/category/phan-mem-va...,/category/phan-mem-van-phong/,2.0,0.0
7362,justanote.xyz,justanote.xyz/cach-xuat-san-pham-tren-wordpres...,/cach-xuat-san-pham-tren-wordpress-cho-merchan...,1.0,0.0


In [8]:
print('Console dataset:')
pd.DataFrame(dataset_console_domain,columns=['Domain'])

Console dataset:


Unnamed: 0,Domain
0,huuthuan.net
1,thachpham.com
2,medium.com
3,alotoi.com
4,linkhay.com
...,...
281,meohayplus.com
282,sketchfab.com
283,loginguides.net
284,logincollector.uk


In [9]:
print('Posts sitemap:')
pd.DataFrame(list_sitemap,columns=['Path'])

Posts sitemap:


Unnamed: 0,Path
0,how-to-add-watermark-to-photo
1,cach-tat-che-do-da-xem-va-thong-bao-tren-zalo
2,khang-trang-khong-duoc-phep-quang-cao
3,ban-tai-khoan-quang-cao-facebook
4,auto-phan-mem-ket-ban-facebook
...,...
195,cai-mail-cong-ty-vao-outlook-tren-dien-thoai-v...
196,gia-lap-may-tinh-casio-fx-580vn-plus-fx-570vn-...
197,co-nen-mua-ten-mien-vn
198,loi-lien-ket-ban-theo-doi-da-het-han


## Merging all datasets into one

### Step 1: Merging distinct domain, URL, Path
Ahref and MOZ datasets have the domain, URL, and Path. But Console dataset does not. So I'll merge Ahref and MOZ first and then Console.

In [10]:
# Merging Ahref and MOZ datasets
dataset_merged_domain_url_path = []
for row in dataset_ahref_extracted + dataset_moz_extracted:
    domain = row[0]
    url = row[1]
    path = row[2]
    data = [domain,url,path]
    if data not in dataset_merged_domain_url_path:
        dataset_merged_domain_url_path.append(data)

#Merging Console domain to merged Ahref and MOZ dataset
dataset_ahref_moz_domain = []
for row in dataset_merged_domain_url_path:
    domain = row[0]
    if domain not in dataset_ahref_moz_domain:
        dataset_ahref_moz_domain.append(domain)

for domain in dataset_console_domain:
    if domain not in dataset_ahref_moz_domain:
        dataset_merged_domain_url_path.append([domain[0],'None','None'])

# Sorting dataset_merged_domain_url_path by domain
dataset_merged_domain_url_path = sorted(dataset_merged_domain_url_path)

pd.DataFrame(dataset_merged_domain_url_path,columns=['Domain','URL','Path'])

Unnamed: 0,Domain,URL,Path
0,100k.pw,www.100k.pw/domain-list-511,/domain-list-511
1,10k.pw,www.10k.pw/domain-list-511,/domain-list-511
2,117vn.online,117vn.online/,/
3,12bor.com,12bor.com/vi-sao-luot-dang-ky-kenh-youtube-bi-...,/vi-sao-luot-dang-ky-kenh-youtube-bi-giam-khac...
4,157.245.224.202,,
...,...,...,...
10161,youandpie.com,youandpie.com/uu-dai-dac-biet/,/uu-dai-dac-biet/
10162,youtube.com,,
10163,zoacum.com,zoacum.com/website-list-821/,/website-list-821/
10164,zonealarm.com,,


### Step 2: Merging all data into one dataset
Now, I can merge all data into one dataset with columns: Domain | URL | Path | DR | DT | DA | SS | Disavow.

In [11]:
#Merging all data into one dataset
dataset_group = []
for row in dataset_merged_domain_url_path:
    domain = row[0]
    url = row[1]
    path = row[2]
    data = [domain,url,path,'None','None','None','None','domain:'+domain]

    # Get DR from Ahref
    for a_row in dataset_ahref_extracted:
        a_domain = a_row[0]
        a_dr = a_row[3]
        a_dt = a_row[4]
        if domain == a_domain:
            data[3] = a_dr
            data[4] = a_dt

    # Get DA and SS from MOZ
    for m_row in dataset_moz_extracted:
        m_domain = m_row[0]
        m_da = m_row[3]
        m_ss = m_row[4]
        if domain == m_domain:
            data[5] = m_da
            data[6] = m_ss

    dataset_group.append(data)

pd.DataFrame(dataset_group,columns=['Domain','URL','Path','DR','DT','DA','SS','Disavow'])

Unnamed: 0,Domain,URL,Path,DR,DT,DA,SS,Disavow
0,100k.pw,www.100k.pw/domain-list-511,/domain-list-511,0.5,0,2.0,96.0,domain:100k.pw
1,10k.pw,www.10k.pw/domain-list-511,/domain-list-511,0.3,0,3.0,96.0,domain:10k.pw
2,117vn.online,117vn.online/,/,0.0,0,,,domain:117vn.online
3,12bor.com,12bor.com/vi-sao-luot-dang-ky-kenh-youtube-bi-...,/vi-sao-luot-dang-ky-kenh-youtube-bi-giam-khac...,,,3.0,35.0,domain:12bor.com
4,157.245.224.202,,,,,,,domain:157.245.224.202
...,...,...,...,...,...,...,...,...
10161,youandpie.com,youandpie.com/uu-dai-dac-biet/,/uu-dai-dac-biet/,0.0,0,,,domain:youandpie.com
10162,youtube.com,,,,,,,domain:youtube.com
10163,zoacum.com,zoacum.com/website-list-821/,/website-list-821/,61.0,1,59.0,12.0,domain:zoacum.com
10164,zonealarm.com,,,,,,,domain:zonealarm.com


I feel satisfied with this complete dataset. Now I can use it with many conditions to extract to Disavow list.

## Extracting bad domains to Disavow list and DMCA report list

**Conditions for defining a bad domain:**
- Domain and path not in bad keywords, path not contain my post's path
- Domain Authority (DA) > 15 and Spam Score (SS) < DA*20% < 10%
- Domain Rating (DR) > 30 and Domain Traffic (DT) > 3000

**Conditions for defining a URL infringing content:**
- The path matches my post's path.

### Step 1: Creating some functions for conditions
- is_bad_domain: The domain contains a bad keyword.
- is_bad_path: The path contains a bad key keyword.
- is_copied_path: The path matches my post's path.

In [12]:
## Checking if the domain contains a bad keyword
def is_bad_domain(domain):
    for row in list_key_domain:
        keyword = row[0]
        if '.' in keyword: # Checking suffix exactly ('.in' keyword will not match '.info' suffix)
            if tldextract.extract(domain).suffix in keyword:
                return True
        elif keyword in domain:
            return True
    return False

## Checking if the path contains a bad keyword
def is_bad_path(path):
    for row in list_key_path:
        keyword = row[0]
        if keyword in path:
            return True
    return False

## Checking if the path matches my post's path
def is_copied_path(path):
    for row in list_sitemap:
        post = row[0]
        if post in path:
            return True
    return False

print('Bad domain test:',is_bad_domain('abc.pw'))
print('Bad path test:',is_bad_path('domain-list.html'))
print('Matches path test:',is_copied_path('/digital-marketing/chuyen-doi-tawk-to.html'))

Bad domain test: True
Bad path test: True
Matches path test: True


### Step 2: Extracting bad domains to Disavow list, DMCA report list, and Manual Check list

In [13]:
# Extracting bad domains to Disavow list, DMCA report list, and Manual Check list
disavow = []
dmca_report = []
manual_check = []

for row in dataset_group:
    domain = row[0]
    path = row[2]
    dr = row[3]
    dt = row[4]
    da = row[5]
    ss = row[6]
    dis = [row[7],dr,dt,da,ss] #Removing URL and Path for Disavow list

    #Checking if domain in white list
    if [domain] in list_whitelist:
        pass

    #Checking if path matches my post's path
    elif is_copied_path(path):
        disavow.append(dis)
        dmca_report.append(row)

    #Checking if domain and path contain a bad keyword
    elif not is_bad_domain(domain) and not is_bad_path(path):
        if dr != 'None' and dr > 30 and dt != 'None' and dt > 3000:
            manual_check.append(row)
        elif da != 'None' and da > 15 and ss != 'None' and ss < da*0.2 and ss < 0.1:
            manual_check.append(row)
        else:
            disavow.append(dis)
    else:
        disavow.append(dis)


#Removing duplicated rows for Disavow list
disavow_distinct = pd.DataFrame(disavow).drop_duplicates().values.tolist()

header_disavow = ['Disavow domain','DR','DT','DA','SS']
header_full = ['Domain','URL','Path','DR','DT','DA','SS','Disavow']

print('Disavow list:')
pd.DataFrame(disavow_distinct,columns=header_disavow)

Disavow list:


Unnamed: 0,Disavow domain,DR,DT,DA,SS
0,domain:100k.pw,0.5,0,2.0,96.0
1,domain:10k.pw,0.3,0,3.0,96.0
2,domain:117vn.online,0.0,0,,
3,domain:12bor.com,,,3.0,35.0
4,domain:157.245.224.202,,,,
...,...,...,...,...,...
944,domain:yoolakids.com,23.0,0,,
945,domain:youandpie.com,0.0,0,,
946,domain:zoacum.com,61.0,1,59.0,12.0
947,domain:zonealarm.com,,,,


Now I'll send the Disavow Domain column to Google Search Console. It is distinct (one line per domain).

In [14]:
print('DMCA report list:')
pd.DataFrame(dmca_report,columns=header_full)

DMCA report list:


Unnamed: 0,Domain,URL,Path,DR,DT,DA,SS,Disavow
0,1900baby.com,www.1900baby.com/cach-ngan-chan-danh-cap-noi-d...,/cach-ngan-chan-danh-cap-noi-dung-website-web-...,1.9,0,,,domain:1900baby.com
1,1gia.vn,1gia.vn/cach-ngan-chan-danh-cap-noi-dung-websi...,/cach-ngan-chan-danh-cap-noi-dung-website-web-...,23.0,0,,,domain:1gia.vn
2,2048.click,2048.click/how-to-add-google-analytics-to-webs...,/how-to-add-google-analytics-to-website/,,,3.0,38.0,domain:2048.click
3,25giay.vn,25giay.vn/digital-marketing/chuyen-doi-tawk-to...,/digital-marketing/chuyen-doi-tawk-to.html,11.0,41343,30.0,2.0,domain:25giay.vn
4,360baby.vn,360baby.vn/cach-ngan-chan-danh-cap-noi-dung-we...,/cach-ngan-chan-danh-cap-noi-dung-website-web-...,26.0,0,,,domain:360baby.vn
...,...,...,...,...,...,...,...,...
452,webgiasi.vn,webgiasi.vn/cach-cai-gmail-vao-outlook-website...,/cach-cai-gmail-vao-outlook-website-chia-se-nh...,11.0,6667,,,domain:webgiasi.vn
453,winbokids.vn,winbokids.vn/cach-ngan-chan-danh-cap-noi-dung-...,/cach-ngan-chan-danh-cap-noi-dung-website-web-...,23.0,0,,,domain:winbokids.vn
454,wsg.vn,wsg.vn/nhan-mo-tai-khoan-quang-cao-bi-vo-hieu-...,/nhan-mo-tai-khoan-quang-cao-bi-vo-hieu-hoa-bi...,25.0,4,4.0,28.0,domain:wsg.vn
455,yaamedia.vn,yaamedia.vn/tracking-chuyen-doi-tawk-to-bang-g...,/tracking-chuyen-doi-tawk-to-bang-google-tag-m...,0.0,0,,,domain:yaamedia.vn


I need to check if these URLs are infringing on my content. If true, I'll send it to Google Copyright Removal.

In [15]:
manual_check_distinct = len(pd.DataFrame(manual_check,columns=header_full).drop(['URL','Path','DR','DA','SS','Disavow'], axis=1).drop_duplicates().reset_index().values.tolist())

print('Manual check list: only',manual_check_distinct,'distinct domains')
pd.DataFrame(manual_check,columns=header_full)

Manual check list: only 15 distinct domains


Unnamed: 0,Domain,URL,Path,DR,DT,DA,SS,Disavow
0,1ty.vn,1ty.vn/Phai-Lam-Gi-Khi-May-Quet-Adobe-Khong-Ho...,/Phai-Lam-Gi-Khi-May-Quet-Adobe-Khong-Ho-tro-c...,43.0,40043,,,domain:1ty.vn
1,aol.com,search.aol.com/reviews,/reviews,91.0,32461120,,,domain:aol.com
2,bingparachute.com,s.bingparachute.com/search,/search,1.1,0,23.0,0.0,domain:bingparachute.com
3,bloggers.jp,"ping.bloggers.jp/index.php/10,%208-1/article/1...","/index.php/10,%208-1/article/13/BestFiles.org",,,29.0,0.0,domain:bloggers.jp
4,bloggers.jp,"ping.bloggers.jp/index.php/10,%208-1/article/a...","/index.php/10,%208-1/article/automata.cc/https...",,,29.0,0.0,domain:bloggers.jp
...,...,...,...,...,...,...,...,...
171,sitelike.org,www.sitelike.org/similar/vltoolkit.com/,/similar/vltoolkit.com/,64.0,8695329,,,domain:sitelike.org
172,sitelike.org,www.sitelike.org/similar/webico.vn/,/similar/webico.vn/,64.0,8695329,,,domain:sitelike.org
173,sitelike.org,www.sitelike.org/similar/wiki19.com/,/similar/wiki19.com/,64.0,8695329,,,domain:sitelike.org
174,sitelike.org,www.sitelike.org/similar/yaytext.com/,/similar/yaytext.com/,64.0,8695329,,,domain:sitelike.org


Because some domains do have DA,DT, DR, SS, and Path, they do not match any conditions. I'll check it manually with Ahref Quick Batch Analysis to find their DA and add bad backlinks to Disavow list later.

## Saving three final lists to Excel files

In [16]:
#Saving three final lists to Excel files
pd.DataFrame(disavow_distinct).to_excel('Output/Disavow.xlsx',index=None,header=header_disavow)
pd.DataFrame(manual_check).to_excel('Output/ManualCheck.xlsx',index=None,header=header_full)
pd.DataFrame(dmca_report).to_excel('Output/DmcaReport.xlsx',index=None,header=header_full)

## Conclusion
I have gotten Disavow list, DMCA Report list, and Manual Check list. These lists can help me prevent bad backlinks attacks. I need to do this work once a week to protect my website rank on Google Search.